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PREFACE 
to the first edition 


This introduction to numerical analysis was written for students in mathematics, 
the physical sciences, and engineering, at the upper undergraduate to beginning 
graduate level. Prerequisites for using the text are elementary calculus, linear 
algebra, and an introduction to differential equations. The student’s level of 
mathematical maturity or experience with mathematics should be somewhat 
higher; I have found that most students do not attain the necessary level until 
their senior year. Finally, the student should have a knowledge of computer 
programming. The. preferred language for most scientific programming is For- 
tran. : 
A truly effective use of numerical analysis in applications requires both a 
theoretical knowledge of the subject and computational experience with it. The 
theoretical knowledge should include an understanding of both the original 
problem being solved and of the numerical methods for its solution, including 
their derivation, error analysis, and an idea of when they will perform well or 
poorly. This kind of knowledge is necessary even if you are only considering 
using a package program from your computer center. You must still understand 
the program’s purpose and limitations to know whether it applies to your 
particular situation or not. More importantly, a majority of problems cannot be 
solved by the simple application of a standard program. For such problems you 
must devise new numerical methods, and this is usually done by adapting 
standard numerical methods to the new situation. This requires a good theoreti- 
cal foundation in numerical analysis, both to devise the new methods and to 
avoid certain numerical pitfalls that occur easily in a number of problem areas. 
Computational experience is also very important. It gives a sense of reality to 
most theoretical discussions; and it brings out the important difference between 
the exact arithmetic implicit in most theoretical discussions and the finite-length 
arithmetic computation, whether on a computer or a hand calculator. The use of 
a computer also imposes constraints on the structure of numerical methods, 
constraints that are not evident and that seem unnecessary from a strictly 
mathematical viewpoint. For example, iterative procedures are often preferred 
over direct procedures because of simpler programming requirements or com- 
puter memory size limitations, even though the direct procedure may seem 
simpler to explain and to use. Many numerical examples are given in this text to 
illustrate these points, and there are a number of exercises that will give the 
student a variety of computational experience. 
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The book is organized in a fairly standard manner. Topics that are simpler, 
both theoretically and computationally, come first; for example, rootfinding for a 
single nonlinear equation is covered in Chapter 2. The more sophisticated topics 
within numerical linear algebra are left until the last three chapters. If an 
instructor prefers, however, Chapters 7 through 9 on numerical linear algebra can 
be inserted at any point following Chapter 1. Chapter 1 contains a number of 
introductory topics, some of which the instructor may wish to postpone until - 
later in the course. It is important, however, to cover the mathematical and 
notational preliminaries of Section 1.1 and the introduction to computer 
floating-point arithmetic given in Section 1.2 and in part of Section 1.3. 

The text contains more than enough material for a one-year course. In 
addition, introductions are given to some topics that instructors may wish to 
expand on from their own notes. For example, a brief introduction is given to 
stiff differential equations in the last part of Section 6.8 in Chapter 6; and some 
theoretical foundation for the least squares data-fitting problem is given in 
Theorem 7.5 and Problem 15 of Chapter 7. These can easily be expanded by 
using the references given in the respective chapters. 

Each chapter contains a discussion of the research literature and a bibliogra- 
phy of some of the important books and papers on the material of the chapter. 
The chapters all conclude with a set of exercises. Some of these exercises are 
illustrations or applications of the text material, and others involve the develop- 
ment of new material. As an aid to the student, answers and hints to selected 
exercises are given at the end of the book. It is important, however, for students 
to solve some problems in which there is no given answer against which they can 
check their results. This forces them to develop a variety of other means for 
checking their own work; and it will force them to develop some common sense 
or judgment as an aid in knowing whether or not their results are reasonable. 

I teach a one-year course covering much of the material of this book. Chapters 
1 through 5 form the first semester, and Chapters 6 through 9 form the second 
semester. In most chapters, a number of topics can be deleted without any 


‘difficulty arising in later chapters. Exceptions to this are Section 2.5 on linear 


iteration methods, Sections 3.1 to 3.3, 3.6 on interpolation theory, Section 4.4 on 
orthogonal polynomials, and Section 5.1 on the trapezoidal and Simpson integra- 
tion rules. , 

I thank Professor Herb Hethcote of the University of Iowa for his helpful 
advice and for having taught from an earlier rough draft of the book. I am also 
grateful for the advice of Professors Robert Barnhill, University of Utah, Herman 
Burchard, Oklahoma State University, and Robert J. Flynn, Polytechnic Institute 
of New York. I am very grateful to Ada Burns and Lois Friday, who did an 
excellent job of typing this and earlier versions of the book. I thank the many 
students who, over the past twelve years, enrolled in my course and used my 
notes and rough drafts rather than a regular text. They pointed out numerous 
errors, and their difficulties with certain topics helped me in preparing better 
presentations of them. The staff of John Wiley have been very helpful, and the 
text is much better as a result of their efforts. Finally, I thank my wife Alice for 
her patient and encouraging support, without which the book would probably 
have not been completed. 


Iowa City, August, 1978 Kendall E. Atkinson 
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ONE 


ERROR: 

ITS SOURCES, 
PROPAGATION, 
AND ANALYSIS 


The subject of numerical analysis provides computational methods for the study 
and solution of mathematical problems. In this text we study numerical methods 
for the solution of the most common mathematical problems and we analyze the 
errors present in these methods. Because almost all computation is now done on 
digital computers, we also discuss the implications of this in the implementation 
of numerical methods. 

The study of error is a central concern of numerical analysis. Most numerical 
methods give answers that are only approximations to the desired true solution, 
and it-is:important to understand-and to be able, if possible, to estimate or bound 
the resulting error. This chapter examines the various kinds of errors that may 
occur in a problem. The representation of numbers in computers is examined, 
along with the error in computer arithmetic. General results on the propagation 
of errors in calculations are given, with a detailed look at error in summation 
procedures. Finally, the concepts of stability and conditioning of problems and 
numerical methods are introduced and illustrated. The first section contains 
mathematical preliminaries needed for the work of later chapters. 


1.1 Mathematical Preliminaries 


This section contains a review of results from calculus, which will be used in this 
text. We first give some mean value theorems, and then we present and discuss 
Taylor’s theorem, for functions of one and two variables. The section concludes 
with some notation that will be used in later chapters. 


Theorem I.1 (Intermediate Value) Let (x) be continuous on the finite interval 
a<x <b, and define 


m = Infimum/(x), M = Supremum f(x) 


asx<b a<x<b 


Then for any number § in the interval [m, M], there is at least one 
point in [a, b] for which 


f(g)=s 
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In particular, there are points x and xX in [a, b] for which 
m= f(x), M = f(x) 


Theorem 1.2 (Mean Value) Let f(x) be continuous for a < x < b, and let it be 
differentiable for a < x < b. Then there is at least one point € in 
' _(a, b) for which 


f(b) — f(a) = f'(é)(b — a) 


Theorem 1.3 (Integral Mean Value) Let w(x) be nonnegative.and integrable.on 
{a, b], and let f(x) be continuous on [a, b]. Then 


f "w(x) f(x) dx = f(é) [ w(x) dx 


for some é € [a, b]. 

These theorems are discussed in most elementary calculus textbooks, and thus 
we omit their proofs. Some implications of these theorems are examined in the 
problems at the end of the chapter. 

One of the most important tools of numerical analysis is Taylor’s theorem and 
the associated Taylor series. It is used throughout this text. The theorem gives a 
relatively simple method for approximating functions f(x) by polynomials, and 
thereby gives a method for computing f(x). 


Theorem 1.4 (Taylor’s Theorem) Let f(x) have n + 1 continuous derivatives 
oni [a, b} for some n > 0, and let x, x9 & (a, b]. Then 


fx) = pal) + Rees) (1.1.1 
“pat = f(9) + F=*9) (5) 
pep FV pons) (11.2) 


1 px 
Rale)=— [ (x — 1)" (rt) at 
(1.1.3) 


for some & between x, and x. 


Proof The derivation of (1.1.1) is given in most calculus texts. It uses carefully 
chosen integration by parts in the identity 


f(x) = f(x) + [P(e at 
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repeating it n times to obtain (1.1.1)~(1.1.3), with the integral form of 
the remainder R,,,(x). The second form of R,,,(x) is obtained by 
using the integral mean value theorem with w(t) = (x — t)". | 
Using Taylor’s theorem, we obtain the following standard formulas: 
ntl 


x? x" x 


Fe] tyt—Heee p— 4 —~——— is 1.1.4 
e Ts nt (n+i1)!° Gas) 


2 ‘ x4 x 1)" x2" 
4! (2n)! 


cos(x) =1- 7 


siti int? 
+(-1) Gn pis (Es) (1.1.5) 
x? 5 


sin (x) =x-— + oa are +(-1)""" 


31° OS! (2n — 1)! 


2n+1 


+(-1)" Gye) (1.1.6) 


(14 x)t=14(T]et (Z)xt+-- +(M)x" 


a xt 
+( Sos +o (1.1.7) 


with 


~ a(a—1)---(a-k+1) 
(4) = k =1,2,3,... 
for any real number a. For all cases, the unknown point é, is located between x 
and 0. 
An important special case of (1.1.7) is 


ttl 


Slt xtxr>gers tx%+ x#1 (1.1.8) 


1-x 1-x 


This is the case a = —1, with x replaced by —x. The remainder has a simpler 
form than in (1.1.7); it is easily proved by multiplying both sides of (1.1.8) by 
1 — x and then simplifying. Rearranging (1.1.8), we obtain the familiar formula 
for a finite geometric series: 


1— xt! 
DRE A ee Sire ee x¥1 (1.1.9) 
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Infinite series representations for the functions on the left side of (1.1.4) to 
(1.1.8) can be obtained by letting n — oo. The infinite series for (1.1.4) to (1.1.6) 
converge for all x, and those for (1.1.7) and (1.1.8) converge for |x| <1. 
Formula (1.1.8) leads to the well-known infinite geometric series 


= )ixk [x] <1 (1.1.10) 


The Taylor series of -any sufficiently differentiable function f(x) can be 
calculated directly from the definition (1.1.2), with as many terms included as 
desired. But because of the complexity of the differentiation of many functions 
f(x), it is often better to obtain indirectly their Taylor polynomial approxima- 
tions p,(x) or their Taylor series, by using one of the preceding formulas (1.1.4) 
through (1.1.8). We give three examples, all of which have simpler error terms 
than if (1.1.3) were used directly. 


Example 1. Let f(x) = e7*. Replace x by — x? in (1.1.4) to obtain 


4 2n xint2 


on A ee god t, 
oe (n+ 1)!° 


x 
e i ee eee 


with —x?<é, <0. 


2. Let f(x) = tan~!(x). Begin by setting x = —u? in (1.1.8) 


1 5.4 y2nt2 
an ange? , eee +(— a. Qn 4 aif n 
1+? Da ee ae an 1+ u? 
Integrate over [0, x] to get 
3 x x2nth 
M(x)=x-— +a: +(-1)" 
See age as ra 
2n+2 
+(-1)"*" d fA 
(=1)* [a (1.1.11) 


Applying the integral mean value theorem 


7 wrt ents 1 


d: a 
o l+u? 2n+3 1+ 


with £, between 0 and x. 
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3. Let f(x) = fd sin(xt) dt. Using (1.1.6) 


Pa Se A: 
ie) a ae -+(-1) (ni)! 


+(- 1)" eo ba) dt 


tin} Intl 


=X Day a 


j= 


with &,, between 0 and xt. The integral in the remainder is easily bounded by 
1/(2n + 2); but we can also convert it to a simpler form. Although it wasn’t 
proved, it can be shown that cos(é,,) is a continuous function of f. Then 
‘applying the integral mean value theorem 


x2s-1 Qn+1 


ff sir-(x0) de = me ij? hice ay + (ag p08 (Se) 


for some {, between 0 and x. 


Taylor’s theorem in two dimensions Let f(x, y) be a given function of the two 
independent variables x and y. We will show how the earlier Taylor’s theorem 
can be extended to the expansion of f(x, y) about a given point (x9, y)). The 
results will easily extend to functions of more than two variables. As notation, let 
LX; Yo: X;, ¥1) denote the set of all points (x, y) on the straight line segment 


joining (x9, Yo) and (x,, );). 
Theorem 1.5 Let (Xo, Yo) and (Xo + & yo + ) be given points, and assume 


f(x, y) is n + 1 times continuously differentiable for all (x, y) in 
some neighborhood of L(x9, Yo; Xo + & Yo + 7). Then 


f(xo + & yo + 0) 


Lap | 7] Ce 
“How +E eg + 1g [409 


X™* XQ 
y=Yo 


peuaee alas?) 


YHyYorOy 


1 r] 0 n+l] 
*GD ty f(x, y) 


for some 0 < @ < 1. The point (x9 + 08, yp + 8) is an unknown 
point on the line L(x9, yo; Xp + § Yo + 7). 
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Proof First note the meaning of the derivative notation in (1.1.12). As an 
example 


r] gf 
lex: + "I I(x, y) 


ps Ee) + 2tn- a*f(x, y) a a7f(x, y) 
ax? 


Ox dy dy? 


The subscript notation, x = Xo, Y = Yo, Means the various derivatives 
are to be-evaluated at (x9, Yo). 


The proof of (1:1.12) is based on applying the earlier Taylor’s theorem 
to 


F(t)=f(xo+t&, y+tn) Ost 

Using Theorem 1.4, 
F*(0) FOO) F*D(8) 
POOP gt mag eae 


for some 0 < 8 < 1. Clearly, F(O) = f(x, Yo), F(1) = f(%o + §& Y +0): 
For the first derivative, 


Of (xo + 18, Yo + tn) ne Of (xg + té, yy + tn) 


i ea Ox dy 


a 0 
[ex tz |r) 


xm xg tlt 
ymyotin 


The higher order derivatives are calculated similarly. a 


Example As a simple example, consider See f(x, y) = x/y about 
(Xo> Yo) = (6,2). Let n = 1. Then 


an. 2) af (6, 2) 
dy 


5 76.2) + (x- 8) + (y- 2) 


1. > 47F(a,7) 47f(8,7) 
3-9 a + Ax — 6)(y ~ 2)— iy 


a*F(8,7) 
+(y-2)°- en 


= 34 5(2-6)- (9-2) 


1 ; 1 , 26 
+3 {e-9 "0 ~ Ax ~ Oy — 2) + (y — 2) 3 
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with (6, y) a point on L(6,2; x, y). For (x, y) close to (6,2), 


x ; 1 
—_—_= +—-x-—-— 
y a 


The graph of z = 3 + 4x — 4y is the plane tangent to the graph of z = x/y at 
(x, y, z) = (6, 2, 3). 


Some mathematical notation There are several concepts that are taken up in 
this text that are needed in a simpler form in the earlier chapters. These include 
results on divided differences of functions, vector spaces, and vector and matrix 
norms. The minimum necessary notation is introduced at this point, and a more 
complete development is left to other more natural places in the text. 

For a given function f(x), define 


f(x1) — f(x0) > fl, x2] — Fo, 4] 
flxo.m] = eerie f[ x9. %1) x2] = amare re aes 
(1.1.13) 


assuming X 9, X,,X2 are distinct. These are called the. first- and second-order 
divided differences of f(x), respectively. They are related to derivatives of f(x): 


fixe ml=S(E) fh x0. x12) = af") (1.1.14) 


with € between x and x,, and § between the minimum and maximum of xo, x;, 
and x. The divided differences are independent of the order of their arguments, 
contrary to what might be expected from (1.1.13). More precisely, 


flxo, xy] = fx, Xo] 
f[xo, x1, x2] = f[x;, Xj, x,| (1.1.15) 


for any permutation (i, j, k) of (0,1, 2). The proofs of these and other properties 
are left as problems. A complete development of divided differences is given in 
Section 3.2 of Chapter 3. 

The subjects of vector spaces, matrices, and vector and matrix norms are 
covered in Chapter 7, immediately preceding the chapters on numerical linear 
algebra. We introduce some of this material here, while leaving the proofs till 
Chapter 7. Two vector spaces are used in a great many applications. They are 


R"=(x=] - {lx,,..., x, real numbers 


C[a, b] = { f(t) [f(t) continuous and real valued, a < ¢ < b} 


10 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS 
For x, y & R" and a a real number, define x + y and ax by 


Xy +), aX) 
x+y= ‘ ax=| : 


i ax, 


For f, g € C[a, b] and a a real number, define f + g and af by 


(f+ ale)=f(t)+alt)  (af)(t)=af(t) astsb 


Vector norms are used to measure the magnitude of a vector. For R”, we 
define initially two different norms: 


loo = max |x| x» eR" (1.1.16) 
lsisn 
lIXla= xi ts- +27 x ER" (1.1.17) 
For C{a, b], define 
flo = max |f(t)| fe Cf{a,b] (1.1.18) 
astsb 


These definitions can be shown to satisfy the following three characteristic 
properties of all norms. 


1. |lv|] = 0; |lvl] = 0 if and only if v = 0, the zero vector 
2. — |lav{| = ja| - |e], for all vectors v and real numbers a 
3. [jv + wil < [lef] + [fw], for all vectors v and w 


Property (3) is usually referred to as the triangle inequality. An explanation of 
this name and a further development of properties and norms for C{a, 5] is given 
in Chapter 4. 

Norms can also be introduced for matrices. For an n X n matrix 


4, Ay a, 
pis 4x, 4@n) Qn 
an 442 aan 
define 
n 
All. = max Lija,| (1.1.19) 
lsisn jul 


With this definition, the properties of a vector norm will be satisfied. In addition, 
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it can be shown that 
AB. S lAlleoll Blleo (1.1.20) 
AXllo < WAM ill. all x © R" | (1.1.21) 


where A and B are arbitrary n X n matrices. The proofs of these results are left 
to the problems. They are also taken up in greater generality in Chapter 7. 


Example Consider the vector space R? and matrices of order 2 x 2. In particu- 
lar, let 


Then 
l4ln°5 ‘Plea 2 tye? 


cand (171.21) is easily satisfied. To show that (1.1.21) cannot be improved upon, 
take 


Then 


Illeo = 5 = TAllol*Hl. 


1.2 Computer Representation of Numbers 


Digital computers are the principal means of calculation in numerical analysis, 
and consequently it is very important to understand how they operate. In this 
section we consider how numbers are represented in computers, and in the 
remaining sections we consider some consequences of the computer representa- 
tion of numbers and of computer arithmetic. 

Most computers have an integer mode and a floating-point mode for repre- 
senting numbers. The integer mode is used only to represent integers, and it will 
not concern us any further. The floating-point form is used to represent real 
numbers. The numbers allowed can be of greatly varying size, but there are 
limitations on both the magnitude of the number and on the number of digits. 
The floating-point ‘representation is closely related to what is called scientific 
notation in many high school mathematics texts. 

The number base used in computers is seldom decimal. Most digital com- 
puters use the base 2 (binary) number esion or some variant of it such as base 8 
(octal) or base 16 (hexadecimal). 
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Example (a) In base 2, the digits are 0 and 1. As an example of the conversion 
of a base 2 number to decimal, we have 


(11011.01), = 1-244+1-234+0-274+1-2!41-29+0-27-141-27? 
= 27.25 


When using numbers in some other base, we will often use (x), to indicate that 
the number x is to be interpreted in the base 8 number system. 


(b) In base 16, the digits are 0,1,...,9,.A,...,.— As an example of the 
conversion of such a number to decimal, we have 


(56C.F),6 = 5-162 + 6-16! + 12 - 16° + 15 - 167! = 1388.9375 
The conversion of decimal numbers to binary is examined in the problems. 


Let 8 denote the number base being used in the computer. Then a nonzero 
number x in the computer is stored essentially in the form 


x =o: (.a,a, --: a,)g° B° (1.2.1) 
with o = +1 or —1,0 <a, < B —1, e an integer, and 


1 a, a, 
(.a,a, --- a,)p= rm + ra ost +B 


The term o is called the sign, e is called the exponent, and (.a, --+ a,), is called 
the mantissa of the floating-point number x. The number £ is also called the 
radix, and the point preceding a, in (1.2.1) is called the radix point, for example, 
decimal! point (8 = 10), binary point (8 = 2). The integer ¢ gives the number of 
base B digits in the representation. We will always assume 


a, #0 


giving what is called the normalized floating-point representation: We will also 
assume that 


L<e<U (1.2.2) 


which limits the possible size of x. The number x = 0 is always allowed, 
requiring a special representation. Table 1.1 contains the values of 8, t, L, and U 
for a number of common computers. The use of 8, t, L, and U to specify the 
arithmetic characteristics is based on that in [Forsythe et al. (1977), p. 11]. Some 
computers use a different placement of the radix point (e.g., CDC CYBER). We 
have modified their exponent bounds so that the limits on the size of a 
floating-point number will be correct when using the theory of this section. We 
also include results for double precision representations on some computers that 
include it in their hardware. In Table 1.1 there are additional columns that will 
be explained later. 
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Table 1.1 Floating-point representations on various computers 


Machine S/D R/C Bp ft L U 8 M 
CDCCYBER170 S R 2 48 -976 1071 3.55E-—15 2.81E14 
CDC CYBER205 S CC 2 47 28,626 28,718 142E-14 1.41E14 
CRAY-1 S C 2 48 8192 8191 7.11E-—15 2.81E14 
DEC VAX S R 2 2% -127 127 5.96E~8 1.68E7 
DEC VAX D R- 2 53 ~1023 1023 111E-16 9.01E15 
HP-11C, 15C Ss R 10 10 -99 99 5.00E-10 1.00E10 
IBM 3033 Ss C 16 6 —64 63 9.54E-7  1.68E7 
IBM 3033 D C 16 14 -64 63 2.22E-16 7.21E16 
Intel 8087 S R 2 24 -126 127 5.96E-8 1.68E7 
Intel 8087 D R 2 53 ~1022 1023 11IE-—16 9.01E15 
PRIME 850 S R 2 23 128 127 119E~7 8.39E6 
PRIME 850 S € a3 2498 127 119E~-7 8.3956 
PRIME 850 D C 2 47 32,896 32,639 142E-14 1.41E14 


Note: S/D: single or double precision; 
R/C: rounding or chopping; 
B: number base (radix); 
t: digits in mantissa [see (1.2.1)]; 
L, U: exponent limits [see (1.2.2)]; 
6: unit round [see (1.2.12)]; 
M: exact integers bound [see (1.2.16)]. 


Chopping and rounding Most real numbers x cannot be represented exactly by 

the floating-point representation previously given, and thus they must be ap- 

proximated by a nearby number representable in the machine, if possible. Given 

an arbitrary number x, we let fl(x) denote its machine approximation, if it exists. 

There are two principal ways of producing fi(x) from x: chopping and rounding. 
Let a real number x be written in the form 


x= 0+ (.4,a2 +++ a,0,4, -"*) p> BS (1.2.3) 


with a, # 0, and assume e satisfies (1.2.2). The chopped machine representation 


_of x is given by 


fl(x) =o- (.a, --- a,)g° Bf (1.2.4) 


The reason for introducing chopping is that many computers use chopping rather 
than rounding after each arithmetic operation. 
The rounded representation of x is given by 


(1.2.5) 


G+ (.a, +++ a,)p > B S444) < 


B 
q 2 
B 
2 


$4,4,< 8 (1.2.6) 


fl(x) = 
o-[(.a, +++ a,)g + (.0--- O1)g] - BY 
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In the last formula, (.0--- 01), denotes 8~‘. Although this definition of fi(x) 
is somewhat formal, it yields the standard definition of rounding that most 
people have learned for decimal numbers. A variation of this definition is 
sometimes used in order to have unbiased rounding. In such a case, if 


(1) aah and (2) a,=0 for j2tt+2 


then whether to round up or down is based on whether a, is odd or even, 
respectively. This_leads.to.the unbiased. rounding rule that most people learn for 
rounding decimal numbers; but-we-will henceforth assume the’simpler definition 
(1.2.5)—(1.2.6). 

With most real numbers x, we have fl(x) # x. Looking at the relative (or 
percentage) error, it can be shown that 


x — f(x) 
es ine (1.2.7) 
x 
with 
~B-'<e<0 chopped fl (x) (1.2.8) 
—4B-'*l<e< itp"! rounded fl(x) (1.2.9) 


We will show the result (1.2.8) for chopping; the result (1.2.9) for rounding is left 
as a problem. 
Assume o = +1, since the case o = —1 will not change the sign of e«. From 
(1.2.3) and (1.2.4), we have 
x — f(x) = (.00---0a,,, +--+ )p- BF 
Letting y = B — 1, 


O<x-— fi(x) < (.00---Oyy «++ )g- BE 


= [po + Bott 4 | . Be 


; st-1 
ae F = =| . Be= Butte 


- x — f(x) . porte 
7 x ~ (.a,a, +++) p> BY 
Br — Qr-ttl 
Wag 


This proves (1.2.8), and the proof of (1.2.9) is similar. 


COMPUTER REPRESENTATION OF NUMBERS 15 
The formula (1.2.7) is usually written in the equivalent form 
fl(x) =(1+.e)x (1.2.10) 


with € given by (1.2.8) or (1.2.9). Thus fi(x) can be considered to be a small 
relative perturbation of x. This formula for fi(x) also allows us to deal precisely 
with the effects of rounding /chopping errors in computer arithmetic operations. 
Examples of this are given in later sections. The definition of fl(x) and the use of 
(1.2.10) is due to [Wilkinson (1963), and it is widely used in analyzing the effects 
of rounding errors. 


Accuracy of floating-point representation We now introduce two measures that 
give a fairly precise idea of the possible accuracy in a floating-point representa- 
tion. The first of these is closely related to the preceding error result (1.2.7)-(1.2.9) 
for fi(x). 

The unit round of a computer is the number 6 satisfying: (1) it is a positive 
floating-point number, and (2) it is the smallest such number for which 


fl(1+8)>1 (1.2.11) 


Thus for any floating-point number 5 < 8, we have i(1 + 8 )=1,and1 + § and 
1 are identical within the computer’s arithmetic. This gives a precise measure of 
how many digits of accuracy are possible in the representation of a number. Most 
high-quality portable computer programs use the unit round in order to note the 
maximal accuracy that is possible on the computer being used. 

The unit round 8 is easily calculated, and it is given by 


| eet chopped definition of fl 
8 = (° as d (x) (1.2.12) 


ipo rounded definition of fl (x) 


We show this for rounded arithmetic on a binary machine. First we must show 
fl(l+27‘)>1 (1.2.13) 
Write 
1+ 27'= [(.10---), + (.00--- 010 --- ),]2? 
Tt 
position t+ 1 


= (10---010---),- 2! (1.2.14) 


Form fi(1 + 27‘), and note that there is a 1 in position ¢ + 1, of the mantissa. 
Then from (1.2.6) 


fl(1 + 2-') = (.10--- 01),-2!=1+4 27") 
t 


position ¢ 
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Thus (1.2.13) is satisfied, although 

fl(1 +27!) #14+27! 
The fact that 6 cannot be smaller than 2~‘ follows easily by reexamining (1.2.14). 
If § < 8, then 1 + § hasaQin position ¢ + 1 of the mantissa in (1.2.14): and the 
definition (1.2.5) of rounding would then imply fi(1 + 6) = 1. 


A second measure of the maximal accuracy possible in a floating-point 
representation is to find the largest integer M for which 


m an integer and0 <m< M = fl(m)=m (1.2.15) 
This also implies fi(M + 1) # M + 1. It is left as a problem to show that 
M = B' (1.2.16) 


The numbers M and 6 for various computers are given in Table 1.1, along with 
whether the computers round (R) or chop (C). 


Example For PRIME computers in single precision, 


M = 2% = 8388608. 


Thus all integers with six decimal digits and most with seven decimal digits can 
be stored exactly in the single precision floating-point representation. Fer the 
unit round, 


§ =2-”% = 238% 10-7 chopped arithmetic 
6=2°-%=119x 107’ rounded arithmetic 


Users of the PRIME have both chopped and rounded arithmetic available to 
them, for single precision arithmetic. 

In almost all cases, rounded arithmetic is greatly preferable to chopped 
arithmetic. This is examined in more detail in later sections, but the main reason 
lies in the biased sign of € in (1.2.8) as compared to the lack of such bias in 
(1.2.9). 


Underflow and overflow When the exponent bounds of (1.2.2) are violated, then 
the associated number x of (1.2.1) cannot be represented in the computer. We 
now look at what this says about the possible range in magnitude of x. 

The smallest positive floating-point. number is , 


x, = (.10-+-0)g-B4 = pt! 
Using y = 8 — 1, the largest positive floating-point number is 


xu = (9+ ps BY= (1 Bo)» BY 
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Thus all floating-point numbers x must satisfy 
X, = |x| < xXy (1.2.17) 


Within the Fortran language of most computers, if an arithmetic operation leads 
to a result x for which |x| > x,, then this will be a fatal error and the program 
will terminate. This is called an overflow error. In contrast, if 


0 < |x| <x, 


then usually fl (x) will be set to zero and computation will continue. This is called 
an underflow error. 


Example Consider evaluating x = s'° on an IBM mainframe computer. Then 
there will be an underflow error if 


gt < 167% 


Thus x = s! will be set to zero if |s| < 1.49 x 10~® Also, there will be an 
overflow error if 


sl > 168 
or equivalently, 


|s} > 3.9 x 107 


1.3. Definitions and Sources of Error 


We now give a rough classification of the major ways in which error is introduced 
into the solution of a problem, including some that fall outside the usual scope of 
mathematics. We begin with a few simple definitions about error. 

In solving a problem, we seek an exact or true solution, which we denote by 
x;. Approximations are usually made in solving the problem, resulting in an 
approximate solution x,. We define the error in x, by 


Error (x,) = Errorin x, = x7 ~— X,4 


For many purposes, we prefer to study the percentage or relative error in x4, 


: . xr X4 
Rel(x,) = relative error in x, = ———— 
Xr 


provided x, # 0. This has already been referred to in (1.2.7), in measuring the 
error in fl(x). 


Example x7 =e = 2.7182818.... x, = 2 = 2.7142857... 
Error (x,) = .003996... Rel (x,) = .00147... 


18 ERROR: ITS SOURCES, PROPAGATION, AND ANALYSIS 


In place of relative error, we often use the concept of significant digits. We say 
x, has m significant (decimal) digits with respect to x; if the error x, — x, has 
magnitude less than or equal to 5 in the (m + 1)st digit of x;, counting to the 
right from the first nonzero digit in x7. 


Example (a) xp=4 = x, = 333 |x-—- x,| = .00033 


Since the error is less than 5 in the fourth digit to the right of the first nonzero 
digit in x7, we say that x, has three significant digits with respect to x,. 


(b) x, = 23.496 =x, = 23.494 [x7 - x4| = .002 
The term x, has four significant digits with respect to x;, since the error is less 
than 5 in the fifth place to the right of the first nonzero digit in x,. Note that if 
x, is rounded to four places, an additional error is introduced and x, will no 
longer have four significant digits. 

-(c) x,= 02138 =x, = 02144 [x7 — x4| = .00006 


The number x, has two significant digits, but not three, with respect to x,. 


The following is sometimes used in measuring significant digits. If 


Xe — X 
WT Al 5x 10-7} (3a) 


XT 


then x, has m significant digits with respect to x,. To show this, consider the 
case .1 < |x| < 1. Then (1.3.1) implies 


[xp — x4) <5 X10" Nxp| <5 x 107" 


Since .1 < x; <1, this implies x, has m significant digits. The proof for a 
general x; is essentially the same, using x7 = %,- 10°, with .1 < [x;| <1, e an 
integer. Note that (1.3.1) is a sufficient condition, but not a necessary condition, 
in order that x, have m significant digits. Examples (a) and (b) just given have 
one more significant digit than that indicated by the test (1.3.1). 


Sources of error We now give a rough classification of the major sources of 
error.. 


(SI) Mathematical modeling of a physical problem A mathematical model 
for a physical situation is an attempt to give mathematical relationships between 
certain quantities of physical interest. Because of the complexity of physical 
reality, a variety of simplifying assumptions are used to construct a more 
tractable mathematical model. The resulting model has limitations on its accu- 
racy as a consequence of these assumptions, and these limitations may or may 
not be troublesome, depending on the uses of the model. In the case that the 
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model is not sufficiently accurate, the numerical solution of the model cannot 
improve upon this basic lack of accuracy. 


Example Consider a projectile of mass m to have been fired into the air, its 
flight path always remaining close to the earth’s surface. Let an xyz coordinate 
system be introduced with origin on the earth’s surface and with the positive 
z-axis perpendicular to the earth and directed upward. Let the position of the 
projectile at time ¢ be denoted by r(t) = x(t)i + y(4)j + z(t)k, using the stan- 
dard vector field theory notation. One model for the flight of the projectile is 
given by Newton’s second law as 


dr(t) Stet pany) 


— 1.3.2 
dt? dt ( ) 


m 


where b > 0 is a constant and g is the acceleration due to gravity. This equation 
says that the only forces acting on the projectile are (1) the gravitational force of 
the earth, and (2) a frictional force that is directly proportional to the speed 
jv(t)| = |dr(t)/dt| and directed opposite to the path of flight. 

In some situations this is an excellent model, and-it-may-not be necessary to 
include even the frictional term. But the model doesn’t include forces of resis- 
tance acting perpendicular to the plane of flight, for example, a cross-wind, and it 
doesn’t allow for the Coriolis effect. Also, the frictional force in (1.29) may be 
proportional to |v(t)|* with a # 1. 

If a model is adequate for physical purposes, then we wish to use a numerical 
scheme that will preserve this accuracy. But if the model is inadequate, then the 
numerical analysis cannot improve the accuracy except by chance. On the other 
hand, it is not a good idea to create a model that is more complicated than 
needed, introducing terms that are relatively insignificant with respect to the 
phenomenon being studied. A more complicated model can often introduce 
additional numerical analysis difficulties, without yielding any significantly greater 
accuracy. For books concerned explicitly with mathematical modeling in the 
sciences, see Bender (1978), Lin and Segal (1974), Maki and Thompson (1973), 
Rubinow (1975). 


(S2) Blunders In precomputer times, chance arithmetic errors were always a 
serious problem. Check schemes, some quite elaborate, were devised to detect if 
such errors had occurred and to correct for them before the calculations had 
proceeded very far past the error. For an example, see Fadeeva (1959) for check 
schemes used when solving systems of linear equations. 

With the introduction of digital computers, the type of blunder has changed. 
Chance arithmetic errors (e.g., computer malfunctioning) are now relatively rare, 
and programming errors are currently the main difficulty. Often a program error 
will be repeated many times in the course of executing the program, and its 
existence becomes obvious because of absurd numerical output (although the 
source of the error may still be difficult to find). But as computer programs 
become more complex and lengthy, the existence of a small program error may 
be hard to detect and correct, even though the error may make a subtle, but 
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crucial difference in the numerical results. This makes good program debugging 
very important, even though it may not seem very rewarding immediately. 

To detect programming errors, it is important to have some way of checking 
the accuracy of the program output. When first running the program, you should 
use cases for which you know the correct answer, if possible. With a complex 
program, break it into smaller subprograms, each of which can be tested 
separately. When the entire program has been checked out and you believe it to 
be correct, maintain a watchful eye as to whether the output is reasonable or not. 


(S3) Uncertainty in physical data Most data from a physical experiment will 
contain error or uncertainty within it. This must affect the accuracy of any 
calculations based on the data, limiting the accuracy of the answers. The 
techniques for analyzing the effects in other calculations of this error are much 
the same as those used in analyzing the effects of rounding error, although the 
error in data is usually much larger than rounding errors. The material of the 
next sections discusses this further. 


(S4) Machine errors By machine errors we mean the errors inherent in using 
the floating-point representation of numbers. Specifically, we mean the round- 
ing /chopping errors and the underflow /overflow errors. The. rounding /chopping 
errors are due to the finite length of the floating-point mantissa; and these errors 


‘occur with all of the computer arithmetic operations. All of these forms of 


machine error were discussed in Section 1.2; in the following sections, we 
consider some of their consequerices. Also, for notational simplicity, we hence- 
forth let the term rounding error include chopping where applicable. , 


(SS) Mathematical truncation error This name refers to the error of ap- 
proximation in numerically solving a mathematical problem, and it is the error 
generally associated with the subject of numerical analysis. It involves the 
approximation of infinite processes by finite ones, replacing noncomputable 
problems with computable ones. We use some examples to make the idea more 
precise. 


Example (a) Using the first two terms of the Taylor series from (1.1.7), 
vVl+x =1+4x (1.3.3) 


which is a good approximation when x is small. See Chapter 4 for the general 
area of approximation of functions. 


(b) For evaluating an integral on [0, 1], use 
24-1 
2n 


1 n 
[fx &= -~YSf ena eo eee (1.3.4) 
0 n jm=l 
This is called the midpoint numerical integration rule: see the last part of Section 
5.2 for more detail. The general topic of numerical integration is examined in 


Chapter 5. 
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(c) For the differential equation problem 
¥(t)=f(t,¥(t)) — ¥(to) = Yo (1.3.5) 
use the approximation of the derivative 


Y(Ut+h)—- Y(t 

y(n) « Ht =O 
h 

for some small h. Let t; = to + jh for j = 0, and define an approximate solution 

function y(t;) by 


y(tya1) — y(t;) 


; = f(t,, y(t;)) 


Thus we have 
¥(ti41) = y(t,) + hf (t;, y(t,)) j20 y(xo) = Y 


This is Euler’s method for solving.an initial value problem for an ordinary 
differential equation. An-extensive-discussion and analysis of it is-given in Section 
6.2. Chapter 6 gives a complete development of numerical methods for solving 
the initial value problem (1.3.5). 


Most numerical analysis problems in the following chapters involve mainly 
mathematical truncation errors. The major exception is the solution of systems of 
linear equations in which rounding errors are often the major source of error. 


Noise in function evaluation One of the immediate consequences of rounding 
errors is that the evaluation of a function f(x) using a computer will lead to an 
approximate function f(x) that is not continuous, although it is apparent only 
when the graph of f(x) is looked at on a sufficiently small scale. After each 
arithmetic operation that is used in evaluating f(x), there will usually be a 
rounding error. When the effect of these rounding errors is considered, we obtain 
a computed value f(x) whose error f(x) — f(x) appears to be a small random 
number as x varies. This error in f(x) is called noise. When the graph of f(x) is 
looked at on a small enough scale, it appears as a fuzzy band of dots, where the x 
values range over all acceptable floating-point numbers on the machine. This has 
consequences for many other programs that make use of fl (x). For example, 
calculating the root of f(x) by using fi (x) will lead to uncertainty in the location 
of the root, because it will likely be located in the intersection of the x-axis and 
the fuzzy banded graph of f(x). The following example shows that this can result 
in considerable uncertainty in the location of the root. 


Example Let 
f(x) =x? - 3x? + 3x-1 (1.3.6) 


which is just (x — 1)°. We evaluated (1.3.6) in the single precision BASIC of a 
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J 
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-1 


Figure 1.1. Graph of (1.3.6). 


9.8E—9 


—8.2E—9 


Figure 1.2 Detailed graph of (1.3.6). 


popular microcomputer, using rounded binary arithmetic with a unit round of 
8 = 274 = 5.96 x 1078. The graph of f(x) on [0,2], shown in Figure 1.1, is 
continuous and smooth to the eye, as would be expected. But the graph on the 
smaller interval [.998, 1.002] shows the discontinuous nature of f(x), as is 
apparent from the graph in Figure 1.2. In this latter case, f(x) was evaluated at 
640 evenly spaced values of x in [.998, 1.002], resulting in the fuzzy band that is 
the graph of f(x). From the latter graph, it can be seen that there is a large 
interval of uncertainty as to where f(x) crosses the x-axis. We return to this topic 
in Section 2.7 of Chapter 2. 


Underflow / overflow errors in calculations We consider another consequence of 
machine errors. The upper and lower limits for floating-point numbers, given in 
(1.2.17), can lead to errors in calculations. Sometimes these are unavoidable, but 
often they are an artifact of the way the calculation is arranged. 

To illustrate this, consider evaluating the magnitude of a complex number, 


|x + iy] = yx? + y? (1.3.7) 


It is possible this may underfiow or overflow, even though the magnitude |x + iy| 
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is within machine limits. For example, if xy = 1.7 x 10°** from (1.2.17), then 
(1.3.7) will overflow for x = y = 107°, even though |x + iy| = 1.4 x 10”. To 
avoid this, determine the larger of x and y, say x. Then rewrite (1.3.7) as 


jx + iy| = |x| - V1 + a? a= Z (1.3.8) 


x 
We must calculate y1 + a”, with 0 < a < 1. This avoids the problems of (1.3.7), 
both for underflow and overflow. 


1.4 Propagation of Errors 


In this and the next sections, we consider the effect of calculations with numbers 
that are in error. To begin, consider the basic arithmetic operations. Let.w denote 
one of the arithmetic operations +, —, X,/; and let & be the computer version 
of the same operation, which will usually include rounding. Let x, and y, be the 
numbers being used for calculations, and suppose they are in error, with true 
values , 


Xp=x,te Yr=yatn 
Then x,@y, is the number actually computed, and for its error, 
Xpoyr — X,Oy, = [xpwyr — x4wy,] + [xgoy, — x,Gy,] (1.4.1) 


The first quantity in brackets is called the propagated error, and the second 
quantity is normally rounding or chopping error. For this second quantity, we 
usually have 


x,y, = fl(x,wy,) (1.4.2) 


which means that x,wy, is computed exactly and then rounded. Combining 
(1.2.9) and (1.4.2), 


: B 7 
|X4@Y4 — X4@y,4| S Glee ral 8 : (1.4.3) 


provided true rounding is used. 
For the propagated error, we examine particular cases. 


Case (a) Multiplication. For the error in x,y, 


XryYr — X4Va = Xr (xp me e)(yr * n) 


= Xn + pre ~ €N 


Xr Vr — X4V4 v7] € € q 
Rel (x Be ee 
(4%) XryYr YT Xr Xr Mr 


Rel (x,) + Rel(y,) — Rel(x,) - Rel(y,) 
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For |Rel(x,)|, |Rel(y4)| «1, 

Rel (x44) = Rel(x,) + Rel(y,) (1.4.4) 
The symbol “ < ” means “much less than.” 


Case (b) Division. By a similar argument, 


A Rel (x,) — Rel(y,) 
el-— => LLL 


Va 1-— Rel y,) ae) 


For |Rel(y,)| < 1, 


x 
Rel ~ Rel(x,) — Rel(y,) (1.4.6) 
A 


For both multiplication and division, relative errors do not propagate rapidly. 
Case (c) Addition and subtraction. 
(xp + yr) — (x4 +94) = (x7 - x4) t Or — ys) = EET 
Error (x, + y,) = Error(x,) + Error (y,) (1.4.7) 


This appears quite good and reasonable, but it can be misleading. The relative 
error in x, + y, can be quite poor when compared with Rel(x,) and Rel(y,). 


Example Let xp = 7, x, = 3.1416, yp = ¥, y, = 3.1429. Then 
Xp—X4%-7.35X10-® — Rel(x,) = —2.34 x 1078 
Yr~Jq* ~429X 10S Rel (y,) + -1.36 x 1078 
(xp — yr) — (x4 — ¥4) = — 0012645 — (— .0013) + 3.55 x 1079 
Rel (x, — y,) = — 028 


Although the error in x, — y, is quite small, the relative error in x4 — y, is 
much larger than that in x, or y, alone. 


Loss of significance errors This last example shows that it is possible to have a 
large decrease in accuracy, in a relative error sense, when subtracting nearly equal 
quantities. This can be a very important way in which accuracy is lost when error 
is propagated in a calculation. We now give some examples of this phenomenon, 
along with suggestions on how it may be avoided in some cases. 
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Example Consider solving ax? + bx + c = 0 when 4ac is small compared to 
b?; use the standard quadratic formula for the roots 


—b + vb? — 4ac —b — yb? — 4ac 
7?) = ——___—_ r= ee ai (1.4.8) 
a 


For definiteness, consider x* — 26x + 1 = 0. The formulas (1.4.8) yield 
rf) = 13 + ¥168 r2 = 13 — 7168 (1.4.9) 


Now imagine using a five-digit decimal machine. On it, ¥168 + 12.961. Then 
define 


r = 13 + 12.961 = 25.961 r) = 13 - 12.961 = .039 (1.4.10) 
Using the exact answers, 
Rel (r) =1.85 1075 = Rel (7) =1.25x 107? = (1.4.11) 
For. the data entering into the calculation (1.4.10), using the notation of (1.4.7), 
xp=x,=13 yp=Vvi68 ~ y, = 12.961 
Rel(x,) =0 Rel(y,) = 3.71 x 1075 


The accuracy in r{” is much less than that of the data x, and y, entering into 
the calculation. We say that significant digits have been lost in the subtraction 
r?) = x4 — y,, or that we have had a /oss-of-significance error in calculating r{). 
In r{?, we have five significant digits of accuracy, whereas we have only two 
significant digits in r{?). 

To cure this particular problem, of accurately calculating r{?’, convert (1.4.9) 


to 
ey 13 — y168 13 + y168 1 
| 13+ 7168 13+ y168 
Then use 


1 1 
13+ 168 25.961 


There are two errors here, that of ¥168 = 12.961 and that of the final division. 
But each of these will have small relative errors [see (1.4.6)], and the new value of 
r) will be more accurate than the preceding one. By exact calculations, we now 
have 


= 038519 = r? (1.4.12) 


Rel (r?) = —1.03 x 1075, 


much better than in (1.4.11). 
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This new computation of r) demonstrates then the loss-of-significance error 
is due to the form of the calculation, not to errors in the data of the computation. 
In this example it was easy to find an alternative computation that eliminated the 
loss-of-significance error, but this is not always possible. For a complete discus- 
sion of the practical computation of roots of a quadratic polynomial, see 
Forsythe (1969). 


Example With many loss-of-significance calculations, Taylor polynomial ap- 
proximations can be used to eliminate the difficulty. We illustrate this with the 
evaluation of 


ex~—1 


f(x) = flea = x#0 (1.4.13) 
0 


For x = 0, f(0) = 1; and easily, f(x) is continuous at x = 0. 

To see that there is a loss-of-significance problem when x is small, we evaluate 
f(x) at x =14x 107%, using a popular and well-designed ten-digit hand 
calculator. The results are 


e* = 1,000000001 
NG, UT poy 1.4.14) 
ease i 


The right-hand sides give the calculator results, and the true answer, rounded to 
10 places, is 


f(x) = 1.000000001 


The calculation (1.4.14) has had a cancellation of the leading nine digits of 
accuracy in the operands in the numerator. 

To avoid the loss-of-significance error, use a quadratic Taylor approximation 
to e* and then simplify f(x): 


x? .. a? 


get eee 0<t<x<l 


f(x)=14+ 54+ Set (1.4.15) 


With the preceding x = 1.4 x 107°, 
f(x) =1+7x 107" 


with an error of less than 107!*. 
In general, use (1.4.15) on some interval [0, 5], picking 5 to ensure the error in 


x 


f(x) 1+ 5 
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is sufficiently small. Of course, a higher degree approximation to e* could be 
used, allowing a yet larger value of 6. 


In general, Taylor approximations are often useful in avoiding loss-of-signifi- 
cance calculations. But in some cases, the loss-of-significance error is more subtle. 


Example Consider calculating a sum 


ae 
S= Vx; (1.4.16) 
yal 


with positive and negative terms x,, each of which is an approximate value. 
Furthermore, assume the sum S is much smaller than the maximum magnitude 
of the x,. In calculating such a sum on a computer, it is likely that a loss-of-sig- 
nificance error will occur. We give an illustration of this. 

Consider using the Taylor formula (1.1.4) for e* to evaluate e~>: 


elt (5) 4 (3 + (<5) vg f= 5)" 
1! 2! 3! i 


(1.4.17) 


Imagine using a computer with four-digit decimal rounded floating-point arith- 
metic, so that each of the terms in this series must be rounded to four significant 
digits. In Table 1.2, we give these rounded terms x j, along with the exact sum of 
the terms through the given degree. The true value of e~° is .006738, to four 
significant digits, and this is quite different from the final sum in the table. Also, 
if (1.4.17) is calculated exactly for n = 25, then the correct value of e~> is 
obtained to four digits. 

In this example, the terms x, become relatively large, but they are then added 
to form the much smaller number e~>. This means there are loss-of-significance 


Table 1.2 Calculation of (1.4.17) using four-digit decimal arithmetic 


fos) 


— .1960. 


1 — 5.000 — 4.000 14 -TOOLE — 1 02771 
2 12.50 8.500 15 —2334E — 1 .004370 
3 — 20.83 — 12.33 16 .7293E — 2 .01166 
4 26.04 13.71 17 —.2145E — 2 009518 
5 — 26.04 — 12,33 18 5958E — 3 01011 
6 21.70 9.370 19 —.1568E — 3 009957 
7 — 15.50 — 6.130 20 .3920E ~ 4 .009996 
8 9.688 3.558 21 — .9333E — 5 009987 
9 — 5.382 — 1.824 22 -2121E — 5 .009989 
10 2.691 -8670 23 — A611E — 6 009989 
11 — 1.223 — 3560 24 .9607E — 7 .009989 


iy 
N 


—.1921E — 7 
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errors'in the calculation of the sum. To avoid this problem in this case is quite 
easy. Either use 
1 


e 


eS = 


and form e° with a series not involving cancellation of positive and negative 
terms; or simply form e~' = 1/e, perhaps using a series, and mu‘ :iply it times 
itself to form e~*. With other series, it is likely that there will not be such a 
simple solution. 


Propagated error in function evaluations Let f(x) be a given function, and let 
f(x) denote the result of evaluating f(x) on a computer. Then /(x;) denotes the 
desired function value and f(x,) is the value actually computed. For the error, 
we write 


fxr) — flea) = Ler) — Fle) + [F04) - fxa)] (1.4.18) 


The first quantity in brackets is called the propagated error, and the second is the 
error due to evaluating f(x,) on a computer. This second error is generally a 
small random number, based on an assortment of rounding errors that occurred 
in carrying out the arithmetic operations defining f(x). We referred to it earlier 
in Section 1.3 as the noise in evaluating f(x). 

For the propagated error, the mean value theorem gives 


f(xp) — f(xy) = f'(xr)(xr - x4) (1.4.19) 


This assumes that x, and x, are relatively close and that f’(x) does not vary 
greatly for x between x, and x7. 


Example  sin(7/5) — sin(.628) = cos(m/5)[(7/5) — .628] = .00026, which is 
an excellent estimation of the error. 


Using Taylor’s theorem (1.1.12) for functions of two variables, we can gener- 
alize the preceding to propagation of error in functions of two variables: 


flan, Yn) — Stas Va) = Ar VDT ~ Xa) + LAr VDOT — Ya) 
(1.4.20) 


with f, = df/dx. We are assuming that f,(x, y) and f,(x, y) do not vary greatly 
for (x, y) between (x7, yr) and (x4, Y,4). 


Example For f(x, y) = x’, we have f, = yx’~", f, = x” log(x). Then (1.4.20) 
yields 


xpT — x + e+ ypxfr! + q[log (x_)] x# 


Rel (x%) + y,[Rel(x,) + Rel(y,) log (x7)] (1.4.21) 
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The relative error in x3 may be large, even though Rel(x,) and Rel(y,) are 
small. As a further illustration, take yr = y, = 500, x7 = 1.2, x, = 1.2001. Then 
x#7 = 3.89604 x 10°, x¥ = 4.06179 x 10°°, Rel (x3) = —.0425. Compare this 
with Rel (x,) = 8.3 x 107°. 


Error in data If the input data to an algorithm contain only r digits of 
accuracy, then it is sometimes suggested that only r-digit arithmetic should be 
used in any calculations involving these data. This is nonsense. It is certainly true 
that the limited accuracy of the data will affect the eventual results of the 
algorithmic calculations, giving answers that are in error. Nonetheless, there is no 
reason to make matters worse by using r-digit arithmetic with correspondingly 
sized rounding errors. Instead one should use a higher precision arithmetic, to 
avoid any further degradation in the accuracy of results of the algorithm. This 
will lead to arithmetic rounding errors that are less significant than the error in 
the data, helping to preserve the accuracy associated with the data. 


1.5 Errors in Summation 


Many numerical methods, especially in linear algebra, involve summations. In 
this section, we look at various aspects of summation, particularly as carried out 
in floating-point arithmetic. 

Consider the computation of the sum 


S= ye (1.5.1) 


j=l 
with x,,..., x, floating-point numbers. Define 
S, = f(x, + x2) = (x, + x2)(1 + €2) (1.5.2) 


where we have made use of (1.4.2) and (1.2.10). Define recursively 
S41 = f1(S, + x,41) r=2,...,.m—1 
Then 
Siar = (S,+ Xpar)(L + ea) (1.5.3) 


The quantities €5,...,¢€,, satisfy (1.2.8) or (1.2.9), depending on whether chop- 
ping or rounding is used. 
Expanding the first few sums, we obtain the following: 


S, - (x, + x2) 7 €5(x, ie x2) 
Sy — (x, + x. + x3) = (x + xg )eg + (x, + x2)(1 + en) €3 + x55 
= (x, + x2)€, + (x, + x2 + x) €, 
Sg — (xy + x2 + x5 +.%4) = (x, + xy) €. + (4, + x2 + x3); 


+(x, + x2 +x, 4+ x,)€, 
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Table 1.3 Calculating S on a machine using chopping 


n True SL Error LS Error 
10 2.929 2.928 001 2.927 .002 
25 3.816 3.813 .003 3.806 .O10 
50 4.499 4.491 .008 4.479 .020 

100 5.187 . 5.170 017 | 5.142 045 
200 5.878 5.841 037 5.786 .092 
500 6.793 6.692 101 6.569 224 


1000 7.486 7.284 .202 7.069 ALT 


Table 1.4 Calculating S on a machine using rounding 


“n Tme SL Eror LS Enror 
10 2.929 2.929 0.0 2.929 0.0 
a 3.816 3.816 0.0 3.817 —.001 
50 - 4,499 4.500 —.001 4.498 001 

100 5.187 5.187 0.0 5.187 0.0 
200 s«5,878 5.878 0.0 5.876 002 
500 6.793 6.794 —.001 6.783 O10 


1000 7.486 7.486 0.0 7.449 037 


We have neglected cross-product terms ¢,€ j» Since they will be of much smaller 
magnitude. By induction, we obtain 


m \. 
Sa Dox = (ey + xgleg tire HC txg te +x, en 
1 


= x,(e, +e; +--+ +e,,) + xg(e, tes +--+ +e,,) 
+x3(€, + ++: ten) Hts HX em (1.5.4) 


From this formula we deduce that the best strategy for addition is to add from 
the smallest to the largest. Of course, counterexamples can be produced, but over 
a large number of summations, the preceding rule should be best. This is 
especially true if the numbers x; are all of one sign so that no cancellation occurs 
in the calculation of the intermediate sums x, + --- +x,,, m=1,...,n. In this 
case, if chopping is used, rather than rounding, and if all x, > 0, then there is no 
cancellation in the sums of the «;. With the strategy of adding from smallest to 
largest, we minimize the effect of these chopping errors. 


Example Define the terms x, of the sum S as follows: convert the fraction 1// 
to a decimal fraction, round it to four significant digits, and Jet this be x,. To 
make the errors in the calculation of S more clear, we use four-digit decimal 
floating-point arithmetic. Tables 1.3 and 1.4 contain the results of four different 
ways of computing S. Adding S from largest to smallest is denoted by LS, and 
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adding from smallest to largest is denoted by SL. Table 1.3 uses chopped 
arithmetic, with 


—.001 <«, <0 (1.5.5) 


and Table 1.4 uses rounded arithmetic, with 


— 0005 < «, < .0005 (1.5.6) 


The numbers ¢; refer to (1.5.4), and their bounds come from (1.2.8) and (1.2.9). 

In both tables, it is clear that the strategy of adding S from the smallest term 
to the largest is superior to the summation from the largest term to the smallest. 
Of much more significance, however, is the far smaller error with rounding as 
compared to chopping. The difference is much more than the factor of 2 that 
would come from the relative size of the bounds in (1.5.5) and (1.5.6). We next 
give an analysis of this. 


A Statistical analysis of error propagation Consider a general error sum 


n 


E= Veg (1.5.7) 


of the type that occurs in the summation error (1.5.4). A simple bound is 
|Z] <né (1.5.8) 


where 6 is a bound on «,,...,¢€,. Then 6 = .001 or .0005 in the preceding 
example, depending on whether chopping or rounding is used. This bound (1.5.8) 
is for the worst possible case in which all the errors €, are as large as possible and 
of the same sign. 

When-using-rounding;-the symmetry in sign behavior of the ¢,, as shown in 


(1.2.9), makes a major difference in the size of E. In this case, a better model is to’ 
assume that the errors «, are uniformly distributed random variables in the 
interval {[—6, 6] and that they are independent. Then 


The sample mean € is a new random variable, having a probability distribution 
with mean 0 and variance §7/3n. To calculate probabilities for statements 
involving €, it is important to note that the probability distribution for € is 
well-approximated by the normal distribution with the same mean and variance, 
even for small values such as n> 10. This follows from the Central Limit 
Theorem of probability theory [e.g., see Hogg and Craig (1978, chap. 5)]. Using 
the approximating normal distribution, the probability is } that , 


\é] < .398/¥n [EE] < .398Vn 
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and the probability is .99 that 
jé] <1.498/yn = | E| < 1.498¥n (1.5.9) 


The result (1.5.9) is a considerable improvement upon (1.5.8) if » is at all large. 
This analysis can also be applied to the case of chopping error. But in that 

case, —5 <e, <0. The sample mean € now has a mean of —8/2, while 

retaining the same variance of 57/3n. Thus there is a probability of .99 that 


" -149¥n\s < -E<{— +1.49Vn|65 (1.5.10) 
(5 -1498)5-s ~£<(F +1404] 


For large n, this ensures that E will approximately equal n6/2, which is much 
larger than (1.5.9) for the case of rounding errors. 

When these results, (1.5.9) and (1.5.10), are applied to the general summation 
error (1.5.4), we see the likely reason for the significantly different error behavior 
of chopping and rounding in Tables 1.3 and 1.4. In general, rounded arithmetic is 
almost always to be preferred to chopped arithmetic. 

Although statistical analyses give more realistic bounds, they are usually much 
more difficult to compute. As a more sophisticated example, see Henrici (1962, 
pp. 41-59) for a statistical analysis of the error in the numerical solution of 
differential equations. An example is given in Table 6.3 of Chapter 6 of the 
present textbook. 


Inner products Given two vectors x, y © R”, we call 


xty= Vx, (1.5.11) 


j=l 


the inner product of x and y. (The notation x7 denotes the matrix transpose 
of x.) Properties of the inner product are examined in Chapter 7, but we note 
here that 


Il, =. Vxtx = (1.5.12) 
Ix7y| < Ilxllallylle (1.5.13) 


The latter inequality is called the Cauchy—Schwarz inequality, and it is proved in 
a more general setting in Chapter 4. Sums of the form (1.5.11) occur commonly 
in linear algebra problems (for example, matrix multiplication). We now consider 
the numerical computation of such sums. 

Assume x, and y,, i = 1,..., m, are floating-point numbers. Define 


S, = fl (x,y,) 


Sea, = (S, + A(x,%,)) k= 1,2,...,m-1 (1.5.14) 
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Then as before, using (1.2.10), 
S| = xy, (1 oe €,) 


Sy = [Sy + x2y,(1 + €,)](1 + 22) 


Sn = [Spa t+ XmIm(1 + €m)] (1 + 1) 


with the terms ¢€,, 7; satisfying (1.2.8) or (1.2.9), depending on whether chopping 
or rounding, respectively, is used. Combining and rearranging the preceding 
formulas, we obtain 


S,= L x;y;(1- %) (1.5.15) 
j=l 


with 
1 +y= (te) +2,)0 + tar) oo° (1 +4,,) n, =0 
m+ eo ay Ug OY Fn (1.5.16). 


The last approximation is based on ignoring the products of the small terms 
1;7s€;1,- This brings us back to the same kind of analysis as was done earlier for 
the sum (1.5.1). The statistical error analysis following (1.5.7) is also valid. For a 
rigorous bound, it can be shown that if mé < .01, then 


lyjJ <1.01(m+1—-j)§ j=1,...,m, (1.5.17) 


where 6 is the unit round given in (1.2.12) [see Forsythe and Moler (1967, p. 92)]. 
Applying this to (1.5.15) and using (1.5.13), 


IS-S,] < D ix 
i 


< 1.01m - 81)x\|2I yl. (1.5.18) 


This says nothing about the relative error, since x7y can be zero even though all 
x, and y, are nonzero. 

These results say that the absolute error in S,, = x7y does not increase very 
rapidly, especially if true rounding is used and we consider the earlier statistical 
analysis of (1.5.7). Nonetheless, it is often possible to easily and inexpensively 
reduce this error a great deal further, and this is usually very important in linear 
algebra problems. ; 

Calculate each product x,y, in a higher precision arithmetic, and carry out the 
summation in this higher precision arithmetic. When the complete sum has been 
computed, then round or chop the result back to the original arithmetic precision. 
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For example, when x, and y, are in single precision, then compute the products 
and sums in double precision. [On most computers, single and double precision 
are fairly close in running time; although some computers do not implement 
double precision in their hardware, but only in their software, which is slower.] 
The resulting sum S,, will satisfy 


S-S,+ 6S (1.5.19) 


a considerable improvement on (1.5.18) or (1.5.15). This can be used in parts of a 
single precision calculation, significantly improving the accuracy without having 
to do the entire calculation in double precision. For linear algebra problems, this 
may halve the storage requirements as compared to that needed for an entirely 
double precision computation. 


1.6 Stability in Numerical Analysis 


A number of mathematical problems have solutions that are quite sensitive to 
small computational errors, for example rounding errors. To deal with this 
phenomenon, we introduce the concepts of stability and condition number. The 
condition number of a problem will be closely related to the maximum accuracy 
that can be attained in the: solution when using finite-length numbers and 
computer arithmetic. These concepts will then be extended to the numerical 
methods that are used to calculate the solution. Generally we will want to use 
numerical methods that have no greater sensitivity to small errors than was true 
of the original mathematical problem. 

To simplify the presentation, the discussion is limited to problems that have 
the form of an equation 


F(x, y) =0 (1.6.1) 


The variable x is the unknown being sought, and the variable y is data on which 
the solution depends. This equation may represent many different kinds of 
problems. For example, (1) F may be a real valued function of the real variable 
x, and y may be a vector of coefficients present in the definition of F; or (2) the 
equation may be an integral or differential equation, with x an unknown function 
and y a given function or given boundary values. 

We say that the problem (1.6.1) is stable if the solution x depends in a 
continuous way on the variable y. This means that if { y, } is a sequence of values 
approaching y in some sense, then the associated solution values { x,,} must also 
approach x in some way. Equivalently, if we make ever smaller changes in y, 
these must lead to correspondingly smaller changes in x. The sense in which the 
changes are small will depend on the norm being used to measure the sizes of the 
vectors x and y; there are many possible choices, varying with the problem. 
Stable problems are also called well-posed problems, and we will use the two 
terms interchangeably. If a problem is not stable, it is called unstable or ill-posed. 


Example (a) Consider the solution of 


ax* + bx +c=0 a#0 
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Any solution x is a complex number. For the data in this case, we use 
y = (a, b,c), the vector of coefficients. It should be clear from the quadratic 
formula 


—b + vb? - 4ac 


2a 


x= 


that the two solutions for x will vary in a continuous way with the data 
y = (a, b,c). 


(b) Consider the integral equation problem 


1 .715x(t) at ( ) 0 1: (1 6 2) 
peers cet Seer <s< 6. 
if 1.25 — cos (2m(s + t)) oe eae 
This is an unstable problem. There are perturbations 6,(s) = y,(s) — y(s) for 
which 
Max |6,(s)}| 270 as no (1.6.3) 


O<s<l 
and the corresponding solutions x,(s) satisfy 


Max |x,(s)—x(s)|=1 all n2>1 (1.6.4) 


O<s<l 


Specifically, define y,(s) = y(s) + 6,(s) 
; 1] 
6,(s) = 57008 (2nms) G<s21 We] 


Then it can be shown that 
x, (s) — x(s) = cos (2n7s) 
thus proving (1.6.4). 


If a problem (1.6.1) is unstable, then there are serious difficulties in attempting 
to solve it. It is usually not possible to solve such problems without first 
attempting to understand more about the properties of the solution, usually by 
returning to the context in which the mathematical problem was formulated. This 
is currently a very active area of research in applied mathematics and numerical 
analysis [see, for example, Tikhonov and Arsenin (1977) and Wahba (1980)). 

For practical purposes there are many problems that are stable in the 
previously given sense, but that are still very troublesome as far as numerical 
computations are concerned. To deal with this difficulty, we introduce a measure 
of stability called a condition number. It shows that practical stability represents a 
continuum of problems, some better behaved than others. 

The condition number attempts to measure the worst possible effect on the 
solution x of (1.6.1) when the variable y is perturbed by a small amount. Let dy 
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be a perturbation of y, and let x + 6x be the solution of the perturbed equation 


F(x + dx, y + 8y) =0 (1.6.5) 
Define 
I]5x||/||x1| 
K(x) = Supremum ———— 1.6.6) 
sy — tS III ( 
We have used the notation || - || to denote a measure of size. Recall the definitions 


(1.1.16)—(1.1.18) for vectors from R" and C[a, 6]. The-example (1.6.2) used the 
norm (1.1.18) for measuring the perturbations in both x and y. Commonly x and 
y may be different kinds of variables, and then different norms are appropriate. 
The supremum in (1.6.6) is taken over all small perturbations dy for which the 
perturbed problem (1.6.5) will still make sense. Problems that are unstable lead to 
K(x) = o. 

The number .K(x) is called the condition number for (1.6.1). It is a measure of 
the sensitivity of the solution x to small changes in the data y. If K(x) is quite 
large, then there exists small relative changes Sy in y that lead to large relative 
changes 6x in x. But if K(x) is small, say K(x) < 10, then small relative changes 
in y always lead to correspondingly small relative changes in x. Since numerical 
calculations almost always involve a variety of small computational errors, we do 
not want problems with a large condition number. Such problems are called 
ill-conditioned, and they are generally very hard to solve accurately. 


Example Consider solving 
x-a%=0 a>O0 (1.6.7) 
Perturbing y by dy, we have 


dx av t®¥ — gv 
2 ee Sey gy 
x a’ 


For the condition number for (1.6.7). 


bx/x 
K(x = Supremum |—— = Supremum 
by dy/y as 


Restricting Sy to be small, we have 
K(x).=|y-In(a)| (1.6.8) 


Regardless of how we compute x in (1.6.7), if K(x) is large, then small relative 
changes in y will lead to much larger relative changes in x. If K(x) = 10% and if 
the value of y being used has relative error 10~7 due to using finite-length 
computer arithmetic and rounding, then it is likely that the resulting value of x 
will have relative error of about 1073. This is a large drop in accuracy, and there 
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is little way to avoid it except perhaps by doing all computations in longer 
precision computer arithmetic, provided y can then be obtained with greater 
accuracy. 


Example Consider the n X n nonsingular matrix 


{4 1 
- eS z 
i f° 4 1 
Y=/2 3 4 n+1 (1.6.9) 
1 1 1 
n nti "On d 


which is called the Hilbert matrix. The problem of calculating the inverse of Y, or 
equivalently of solving YX =J with J the identity matrix, is a well-posed 
problem. The solution X can be obtained in a finite number of steps using only 
simple.arithmetic operations. But the problem of calculating YX is increasingly 


‘jll-conditioned-as-n “increases. 


The ill-conditioning of the numerical inversion of Y will be shown in a 


. practical setting. Let Y denote the result of entering the matrix Y into an IBM 


370 computer and storing the matrix entries using single precision floating-point 
format. The fractional elements of Y will be expanded i in the hexadecimal (base 
16) number system and then chopped after six hexadecimal digits (about seven 
decimal digits). Since most of the entries in Y do not have finite hexadecimal 
expansions, there will be a relative error of about 10~° in each such element 
of Y. 

Using migher Precision arithmetic, we can calculate the exact value of aan 
The inverse Y~! is known analytically, and we can compare it with Y-!. For 
n = 6, some of the elements of Y~? differ from the corresponding elements in 
Y~! in the first nonzero digit. For example, the entries in row 6, column 2 are 


(Y~1)6.2 = 83160.00 (¥~")¢ 2 = 73866.34 


This makes the calculation of Y~' an ill-conditioned problem, and it becomes 
increasingly so as n increases. The condition number in (1.62) will be at least 10° 
as a reflection of the poor accuracy in Y~! compared with Y~!. Lest this be 
thought of as an odd pathological example that could not occur in practice, this 
particular example occurs when doing least squares approximation theory (e.g., 
see Section 4.3). The general area of ill-conditioned problems for linear systems 
and matrix inverses is considered in greater detail in Chapter 8. 


Stability of numerical algorithms A numerical method for solving a mathemati- 
cal problem is considered stable if the sensitivity of the numerical answer to the 
data is no greater than in the original mathematical problem. We will make this 
more precise, again using (1.6.1) as a model for the problem. A numerical method 
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for solving (1.6.1) will generally result in a sequence of approximate problems 
F,(xq, Yn) = 0 (1.6.10) 


depending on some parameter, say n. The data y, are to approach y as n > 00; 
the function values F,(z, w) are to approach F(z, w) as n > oo, for all (z, w) 
near (x, y); and hopefully the resulting approximate solutions x, will approach 
x as n-> oo. For example, (1.6.1) may represent a differential equation initial 
value problem, and (1.6.10) may present a sequence of finite-difference approxi- 
mations depending on h = 1/n, as in and following (1.3.5). Another case would 
be-where n represents the number of digits being used in the calculations, and we 
may be solving F(x, y) = 0 as exactly as possible within this finite precision 
arithmetic. 

For each of the problems (1.6.10) we can define a condition number K,,(x,,), 
just as in (1.6.6). Using these condition numbers, define 


K(x) = Limit Supremum K,(x,) (1.6.11) 


oO. kon 


We say the numerical method is stable if K (x) is of about the same magnitude as 
K(x) from (1.6.6), for example, if 


K(x) < 2K(x) 


If this is true, then the sensitivity of (1.6.10) to changes in the data is about the 
same as that of the original problem (1.6.1). 

Some problems and numerical methods may not fit easily within the frame- 
work of (1.6.1), (1.6.6), (1.6.10), and (1.6.11), but there is a general idea of stable 
problems and condition numbers that can be introduced and given similar 
meaning. The main use of these concepts in this text is in (1) rootfinding for 
polynomial equations, (2) solving differential equations, and (3) problems in 
numerical linear algebra. Generally there is little problem with unstable numeri- 
cal methods in this text. The main difficulty will be the solution of ill-conditioned 
problems. . 


Example Consider the evaluation of a Bessel function, 


m oo ey 
x=J,(y) = (57) beeen m>0 (1.6.12) 


This series converges very rapidly, and the evaluation of x is easily shown to be a 
well-conditioned problem in its dependence on y. 
Now consider the evaluation of J,,(y) using the triple recursion relation 


Ine) = gl) ~ Ini) med (1.6.13) 


assuming Jo(y) and J,(y) are known. We now demonstrate numerically that this 
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Table 1.5 Computed values of J_,(1) 


m Computed J, (1) True J, (1) 

0 .7651976866 .7651976866 

1 4400505857 4400505857 

2 .1149034848 1149034849 

3 .195633535E — 1 .1956335398E — 1 
4 .24766362E — 2 .2476638964E — 2 
5 .2497361E — 3 .2497577302E — 3 
6 .207248E — 4 .2093833800E — 4 
7 —.10385E — 5 .1502325817E — 5 


is an unstable numerical method for evaluating J,,(y), for even moderately large 
m. We take y = 1, so that (1.6.13) becomes 


Joa (1) = 2mJ,(1) -J,,0) m1 (1.6.14) 


We use values for J,(1) and J,(1) that are accurate to 10 significant digits. The 
subsequent values J,,(1) are calculated from (1.6.4) using exact arithmetic, and 
the results are given in Table 1.5. The true values are given for comparison, and 
they show the rapid divergence of the approximate values from the tme values.: 
The only errors introduced were the rounding errors in Jo(1) and J,(1), and they 
cause an increasingly large perturbation in J,,(1) as m increases. 


The use of three-term recursion relations 


Sn (%) = Gm) Smal) ~ On (2) Fini) m1 


is a common tool in applied mathematics and numerical analysis. But as 
previously shown, they can lead to unstable numerical methods. For a general 
analysis of triple recursion relations, see Gautschi (1967). In the case of (1.6.13) 
and (1.6.14), large loss of significance errors are occurring. 


Discussion of the Literature 


A knowledge of computer arithmetic is important for programmers who are 
concerned with numerical accuracy, particularly when writing programs that are 
to be widely used. Also, when writing programs to be run on various computers, 
their different floating-point characteristics must be taken into account. Classic 
treatments of floating-point arithmetic are given in Knuth (1981, chap. 4) and 
Sterbenz (1974). 

The topic of error propagation, especially that due to rounding/chopping 
error, has been difficult to treat in a precise, but useful manner. There are some 
important early papers, but the current approaches to the subject are due in large 
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part to the late J. H. Wilkinson. Much of his work was in numerical linear 
algebra, but he made important contributions to many areas of numerical 
analysis. For a general introduction to his techniques for analyzing the propa- 
gation of errors, with applications to several important problems, see Wilkinson 
(1963), (1965), (1984). 

Another approach to the control of error is called interval analysis. With it, we 
carry along an interval [x,, x,,] im our calculations, rather than a single number 
x,, and the numbers x, and x, are guaranteed to bound the true value x,. The 
difficulty with this approach is that the size of x, — x, is generally much larger 
than |x;— x,|, mainly because the possible cancellation of errors of opposite 
sign is often not considered“when-computing x, and x,. For an introduction to 
this area, showing how to improve on these conservative bounds in particular 
cases, see Moore (1966). More recently, this area and that of computer arithmetic 
have been combined to give a general theoretical framework allowing the 
development of algorithms with rigorous error bounds. As examples of this area, 
see the texts of Alefeld and Herzberger (1983), and Kulisch and Miranker (1981), 
the symposium proceedings of Alefeld and Grigorieff (1980), and the survey in 
Moore (1979). 

The topic of ill-posed problems was just touched on in Section 1.6, but it has 
been of increasing interest in recent years. There are many problems of indirect 
physical measurement that lead to ill-posed problems, and in this form they are 
called inverse problems. The book by Lavrentiev (1967) gives a general introduc- 
tion, although it discusses mainly (1) analytic continuation of analytic functions 
of a complex variable, and (2) inverse problems for differential equations. One of 
the major numerical tools used in dealing with ill-posed problems is called 
regularization, and an extensive development of it is given in Tikhonov and 
Arsenin (1977). As important examples of the more current literature on numeri- 
cal methods for ill-posed problems, see Groetsch (1984) and Wahba (1980). 

Two new types of computers have appeared in the last 10 to 15 years, and they 
are now having an increasingly important impact on numerical analysis. These 
are microcomputers and supercomputers. Everyone is aware of microcomputers; 
scientists and engineers are using them for an increasing amount of their 
numerical calculations. Initially the arithmetic design of microcomputers was 
quite poor, with some having errors in their basic arithmetic operations. Re- 
cently, an excellent new standard has been produced for arithmetic on microcom- 
puters, and with it one can write high-quality and efficient numerical programs. 
This standard, the IEEE Standard for Binary Floating-Point Arithmetic, is de- 
scribed in IEEE (1985). Implementation on the major families of microprocessors 
are becoming available; for example, see Palmer and Morse (1984). 

The name supercomputer refers to a variety of machine designs, all having in 
common the ability to do very high-speed numerical computations, say greater 
than 20 million floating-point operations per second. This area is developing and 
changing very rapidly, and'so we can only give a few references to hint at the 
effect of these machines on the design of numerical algorithms. Hockney and 
Jesshope (1981), and Quinn (1987) are general texts: on the architecture of 
supercomputers and the design of numerical algorithms on them; Parter (1984) is 
a symposium proceedings giving some applications of supercomputers in a 
variety of physical problems; and Ortega and Voigt (1985) discuss supercom- 
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puters as they are being used to solve partial differential equations. These 
machines will become increasingly important in all areas of computing, and their 
architectures are likely to affect smaller mainframe computers of the type that are 
now more widely used. 

Symbolic mathematics is a rapidly growing area, and with it one can do 
analytic rather than numerical mathematics, for example, finding antiderivatives 
exactly when possible. This area has not significantly affected numerical analysis 
to date, but that appears to be changing. In many situations, symbolic mathe- 
matics are used for part of a calculation, with numerical methods used for the 
remainder of the calculation. One of the most sophisticated of the programming 
languages for carrying out symbolic mathematics is MACSYMA, which is 
described in Rand (1984). For a survey and historical account of programming 
languages for this area, see Van Hulzen and Calmet (1983). 

We conclude by discussing the area of mathematical software. This area deals 
with the implementation of numerical algorithms as computer programs, with 
careful attention given to questions of accuracy, efficiency, flexibility, portability, 
and other characteristics that improve the usefulness of the programs. A major 
journal of the area is the ACM Transactions on Mathematical Software. For an 
extensive survey of the area, including the most important program libraries that 
have been developed in recent years, see Cowell (1984). In the appendix to this 
book, we give a further discussion of some currently available numerical analysis 
computer program packages. 
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Problems 


1. (a) Assume f(x) is continuous on a < x < b, and consider the average 


S225 7G) 


j=l 


spe 


with all points x, in the interval [a, b]. Show that 


S = f(f) 


for some ¢ in [a, b]. Hint: Use the intermediate value theorem and 
consider the range of values of f(x) and thus of S. 


(b) Generalize part (a) to the sum 
i > w,f(x;) 
jal . 
with all x, in [a, b] and all w, > 0. 


2. Derive the following inequalities: 


(a) |e*—e7| < |x-2z| for all x, z < 0. 


T 


(b) |x — 2] < |tan(x) — tan(z)| - = <x, z< a 


() py Nx-y)sx?—y? spxP\x-y) Osysx, p2k 
3. (a) Bound the error in the approximation 


sin(x)=x |x| <6 


ee Ee ee Ee ee 
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(b) For smail values of 5, measure the relative error in sin(x) = x by 
using 


sin(x)-x | sin(x) -—x 
sin(x) - ane 


Bound this modified relative error for |x| < 8. Choose 6 to make this 
error less than .01, corresponding to a 1 percent error. 


4. Assuming g € C[a,.b], show 


fier — x)*g(x) dx = © 90 some {in[a, b] 
0 F 


5. Construct a Taylor series for the following functions, and bound the error 
whien truncating after n terms. 


l px 3 
(a) ~f e-? dt (b) sin“*(x) |x| <1 
x “0 
1 px tdt 
(c) = besarte (d) cos(x) + sin(x) 
xX 49 t 
1+x 
(e) log(i — x) -Il<x<1l (@&# toe | =] -l<x<l 
6. (a) Using the result (1.1.11), we can show 
oe on (-1)7/" 
— = tan (1) = EE ees 
4 @) » ajt+l 


and we can obtain 7 by multiplying by 4. Why is this not a practical 
way to compute 7? 


(b) Using a Taylor polynomial approximation, give a practical way to 
evaluate 1. 


7. Using Taylor’s theorem for functions of two variables, find linear and 
quadratic approximations to the following functions f(x, y) for small 
values of x and y. Give the tangent plane function z = p(x, y) whose 
graph is tangent to that of z = f(x, y) at (0,0, (0, 0)). 


1+x 


(a) yl+2x-y (b) 15 


(c) x -cos(x —y) (d) cos(x + 77 +y) 


11. 


10. 
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8. Consider the second-order divided difference f[x 9, x,, x,] defined in 


(1.1.13). 


(a) 


(b) 


(c) 


(a) 


(b) 


(c) 


Prove the property (1.1.15), that the order of the arguments x9, x,, X2 
does not affect the value of the divided difference. 


Prove formula (1.1.14), 
Ff [xo, x1, x2] = 1f"(S) 


for some { between the minimum and maximum of x9, x,, and x4. 
Hint: From part (a), there is no loss of generality in assuming 
Xo < X, < x2. Use Taylor’s theorem to reduce f[Xp, x), x2], expand- 
ing about x,; and then use the intermediate value theorem to simplify 
the error term. 


Assuming f(x) is twice continuously differentiable, show that 


Ff ({Xo, X;, X2] can be extended continuously to the case where some or 
all of the points x9, x,, and x, are coincident. _For.example, show 


f xo, x1, Xo] = Limit f[x9, x1, x2] 
x27 Xo 


exists and compute a formula for it. 


Show that the vector norms (1.1.16) and (1.1.18) satisfy the three 
general properties of norms that are listed following (1.1.18). 


Show |x|], in (1.1.17) is a vector norm, restricting yourself to the 
n = 2 case. ; 


Show that the matrix norm (1.1.19) satisfies (1.1.20) and (1.1.21). For 
simplicity, consider only matrices of order 2 X 2. 


Convert the following numbers to their decimal equivalents, 


(a) (10101.101),  (b) (243.FF),,  (c) (.101010101...), 


(d) (.AAAA...)4¢ (©) (,00011001100110011 ...), 


(f) 


(11...1), with the parentheses enclosing n 1s. 


To convert a positive decimal integer x to its binary equivalent, 


x = (4,4,_.-- 4,49) 


vckts uate 


12. 


13. 
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begin by writing 
x=a, M+ a,_1°2" 14+ +++ +a,-2+a,-2° 
Based on this, use the following algorithm. 
() x= x f= 0 
(ii) While x; # 0, Do the following: 


a, =-Remainder of integer divide x ,/2° 


X;41 *= Quotient of integer divide x,/2 


jejtl 
End While 


The language of the algorithm should be self-explanatory. Apply it to 
convert the following integers to their binary equivalents. 


(a) 49 (b) 127 (c) 129 
To convert a positive decimal fraction x < 1 to its binary equivalent 


x = (.a,a,4,...). 


begin by writing 
x=a,-27}+a,-27*44,-2-34--- 


Based on this, use the following algorithm. 
(i) x, =x; j= 1 
(ii) While x Fas 0, Do the following: 

a, = Integer part of 2 - x; 


X;41 *= Fractional part of 2 - x, 
juejtl 


Apply this algorithm to convert the following decimal numbers to their 
binary equivalents. 

(a) .8125 (b) 12.0625 (¢) 1 @ 2 () 4 

(f) 4= 142857142857... 


Generalize Problems 11 and 12 to the conversion of a decimal integer to its 
hexadecimal equivalent. 


14. 


15. 


16. 


17. 


18. 
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Predict the output of the following section of Fortran code if it is run on a 
binary computer that uses chopped arithmetic. 


10 


1=0 

X= 0.0 
H=.1 
f=1+1 
X=X+H 
PRINT *, 1, X 


IF (X .LT. 1.0) GO TO 10 


Would the outcome be any different if the statement “XK = X + H” was 
replaced by “X = I*H”? 


Derive the bounds (1.2.9) for the relative error in the rounded floating-point 
representation of (1.2.5)—(1.2.6). 


Derive the upper bound result M = 8‘ given in (1.2.16). 


(a) 


(b) 


(a) 


(b) 


Write a program to create-an-overflow-error on your computer. For 
example, input a number x and repeatedly square it. 


Write a program to experimentally determine the largest allowable 
floating-point number. 


A simple model for population growth is 
— =kN t2 2%, N(to)=N 


with N(t) the population at time ¢ and k > 0. Show that this implies 
a geometric rate of increase ia population: 


N(t+1)=CN(t) 12% 
Find a formula for C. 


A more sophisticated model for population growth is 


IN 
TENT ON] tt, M(to) = 


with b, k > 0 and 1 — bN, > 0. Find the solution to this differential 
equation problem. Compare its solution to that of part (a). Describe 


the differences in population growth predicted by the two models, for 
both large and small values of t. 


19. 


21. 


22. 


25. 
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On your computer, evaluate the two functions 

(a) f(x) =x? - 3x?7+3x-1 

(b) f(x) = x3+2x7-x-2 

Evaluate them for a large sampling of values of x around 1, and try to 
produce the kind of behavior shown in Figure 1.2. Compare the results for 


the two functions. 


‘Write-a program to compute-experimentally 


Limit (x? + y?)'”? 


pro 


where x and y are positive numbers. First do the computation in the form 
just shown. Second, repeat the computation with the idea used in (1.3.8). 
Run the program for a variety of large and small values of x and y, for 
example, x = y = 10!° and x = y = 107". 


For the following numbers x, and x;, how many significant digits are 
there in x, with respect to x7? 


(a) x, = 451.023, X7 = 451.01 

(b) x, = —.045113, Xp = —.04518 

(c) x, = 23.4213, Xp = 23.4604 

Let all of the following numbers be correctly rounded to the number of 
digits shown: (a) 1.1062 + .947, (b) 23.46 — 12.753, (c) (2.747\(6.83), (d) 
8.473 /.064. For each calculation, determine the smallest interval in which 
the result, using true instead of rounded values, must be located. 


Prove the formula (1.4.5) for Rel (x,/y,)- 


Given the equation x* — 40x + 1=0, find its roots to five significant 
digits. Use ¥399 = 19.975, correctly rounded to five digits. 


Give exact ways of avoiding loss-of-significance errors in the following 
computations. 


(a) log(x + 1) — log(x) large x 
(b) sin(x)~-sin(y) xy 


(c) ‘tan(x)-tan(y) x+y 


27. 


29. 


31. 


PROBLEMS 49 


a See 2x5 
x 


® Waa. 229 


Use Taylor approximations to avoid the loss-of-significance error in the 
following computations. 


x ~—~x 


: e e 
(a) f(x)= =o 


log (1 — x) + xe*/? 
() (jee ee 


x 


In both cases, what is Limit f(x)? 
: x7 


Consider evaluating cos(x) for large x by using the Taylor approximation 
(1.1.5), 
x? xen 
cos (x)=1 -_ or ae +(—1) (Qn)! 

To see the difficulty involved in using this approximation, use it to evaluate 
cos(27) = 1. Determine n so that the Taylor approximation error is less 
than .0005. Then repeat the type of computation used in (1.4.17) and Table 
1.2. How should cos (x) be evaluated for larger values of x? 


Suppose you wish to compute the values of (a) cos (1.473), (b) tan~1 (2.621), 
(c) In (1.471), (d) e7**?. In each case, assume you have only a table of values 
of the function with the argument x given in increments of .01. Choose’ the 
table value whose argument is nearest to your given argument. Estimate the 
resulting error. 


Assume that x, = .937 has three significant digits with respect to x,. 
Bound the relative error in x,.. For f(x) = yl — x, bound the error and 
relative error in f(x,) with respect to f(x,). 


The numbers given below are correctly rounded to the number of digits 
shown. Estimate the errors in the function values in terms of the errors in 
the arguments. Bound the relative errors. 

(a) sin [(3.14)(2.685)] (b) In (1.712) 

(c) (1.56)3-444 


Write a computer subroutine to form the sum 


S= Ya, 
1 


32. 
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in three ways: (1) from smallest to largest, (2) from largest to smallest, and 
(3) in double precision, with a single precision rounding /chopping error at 
the conclusion of the summation. Use the double precision result to find the 
error in the two single precision results. Print the results. Also write a main 
program to create the following series in single precision, and use the 
subroutine just given to sum the series. [ Hint: In writing the subroutine, for 
simplicity assume the terms of the series are arranged from largest to 
smallest.] 


nj ay nj n (-1)/ 
@ L> ® Y5 © LZ Oo L— 
1 J ek ra 1 J 
Consider the product a9a,...a,,, Where ao, a,,...,a,, are m+ 1 num- 


bers stored in a computer that uses n digit base 8 arithmetic. Define 
Py = fi(aqa,), Po = fi(a2p1), Ps = fi(43P2),---) Pm = (Gn, Pm—1)- If we 
write p,, = 4 )a,...a,,(1 + w), determine an estimate for w. Assume that 
a, = fi(a,), i= 0,1,..., m. What is a rigorous bound for w? What is a 
statistical estimate for the size of w? 


T 


ee ees 


ROOTFINDING 
FOR NONLINEAR 
EQUATIONS 


Finding one or more roots of an equation 
f(x) =0 (2.0.1) 


is one of the more commonly occurring problems of applied mathematics. In 
most cases explicit solutions are not available and we must be satisfied with being 
able to find a root to any specified degree of accuracy. The numerical methods 
for finding the-roots-are-called iterative methods, and they are the main subject of 
this chapter. 

We begin with iterative methods for solving (2.0.1) when f(x) is any continu- 
ously differentiable real valued function of a real variable x. The iterative 
methods for this quite general class of equations will require knowledge of one or 
more initial guesses xq for the desired root a of f(x). An initial guess x9 can 
usually be found by using the context in which the problem first arose; otherwise, 
a simple graph of y = f(x) will often suffice for estimating x). 

A second major problem discussed in this chapter is that of finding one or 
more roots of a polynomial equation 


P(x) =agtaxt+-+++a,x"=0 a, #0 (2.0.2) 


The methods of the first problem are often specialized to deal with (2.0.2), and 
that will be our approach. But there is a large literature on methods that have 
been developed especially for polynomial equations, using their special properties 
in an essential way. These are the most important methods used in creating 
automatic computer programs for solving (2.0.2), and we will reference some such 
methods. 

The third class of problems to be discussed is the solution of nonlinear systems 
of equations. These systems are very diverse in form, and the associated numeri- 
cal analysis is both extensive and sophisticated. We will just touch on this 
subject, indicating some successful methods that are fairly general in applicabil- 
ity. An adequate development of the subject requires a good knowledge of both 
theoretical and numerical linear algebra, and these topics are not taken up until 
Chapters 7 through 9. 

The last class of problems discussed in this chapter are optimization problems. 
In this case, we seek to maximize or minimize a real valued function f(x,,..., X,) 
and to find the point (x,,...,x,) at which the optimum is attained. Such 
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(xg, f(%g)) 


Figure 2.1 Iterative solution of a — (1/x) = 0. 


problems can often be reduced to a system of nonlinear equations, but it is 
usually better to develop special methods to carry out the optimization directly. 
The area of optimization is well-developed and extensive. We just briefly intro- 
duce and survey the subject. 

To illustrate the concept of an iterative method for finding a root of (2.0.1), we 
begin with an example. Consider solving 


1 
f(x) =a-==0 (2.0.3) 


for a given a > 0. This problem has a practical application to computers without 
a machine divide operation. This was true of some early computers, and some 
modern-day computers also use the algorithm derived below, as part of their 
divide operation. 

Let x =1/a be an approximate solution of the equation. At the point 
(Xo, f(Xo)), draw the tangent line to the graph of y = f(x) (see Figure 2.1). Let 
x, be the point at which the tangent line intersects the x-axis. It should be an 
improved approximation of the root a. 

To obtain an equation for x,, match the slopes obtained from the tangent line 
and the derivative of f(x) at xo 


f (xo) a0 


Xo — Xy 


f'(x0) = 
Substituting from (2.0.3) and manipulating, we obtain 


X, = X9(2 — axg) 
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The general iteration formula is then obtained by repeating the process, with x, 
replacing xo, ad infinitum, to get 
Xn+1 = X,(2 —ax,) nz (2.0.4) 


A form more convenient for theoretical purposes is obtained by introducing 
the scaled residual 


r,=1- ax, (2.0.5) 
Using it, 
Xap =x,(1+7r,) n>z0 (2.0.6) 
For the error, 
1 r, 
é,=—--x,=— (2.0.7) 
a a 


We will analyze the convergence of this method, its speed, and its dependence 
on Xp. First, 


Tay = 1 — axX,4, =1—ax,(1+7,)=1-(-7,)(1 +7) 
ae a (2.0.8) 
Inductively, 


r,= 1 n20 (2.0.9) 


From (2.0.7), the error e, converges to zero as n — oo if and only if r, converges 
to zero. From (2.0.9), r, converges to zero if and only if |r| < 1, or equivalently, 


-1<1-—ax,<1 


2, 
0<x,<- (2.0.10) 
a 


_ In order that x, converge to 1/a, it is necessary and sufficient that x, be chosen 


to satisfy (2.0.10). 

To examine the speed of convergence when (2.0.10) is satisfied, we obtain 
formulas for the error and relative error. For the speed of convergence when 
(2.0.10) is satisfied, 


2,2 
e a Tatl yh e,4 
pe Rae Ge ee 
i a a a 
ey 
C41 = ae; (2.0.11) 
2 
Cnt) = e2a? ma Cn 
l/a i l/a 


Rel(x,4;)=Rel(x,) 120 (2.0.12) 
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The notation Rel (x,,) denotes the relative error in x,. Based on equation (2.0.11), 
we say e, converges to zero quadratically. To illustrate how rapidly the error will 
decrease, suppose that Rel(x,) = 0.1. Then Rel(x,) = 107'5. Each iteration 
doubles the number of significant digits. 

This example illustrates the construction of an iterative method for solving an 
equation; a complete convergence analysis has been given. This analysis included 
a proof of convergence, a determination of the interval of convergence for the 
choice of x9, and a determination of the speed of convergence. These ideas are 
examined in more detail in the following sections using more general approaches 
to solving (2.0.1). 


Definition A sequence of iterates {x,|n > 0} is said to converge with order 
p>=1toa point a if 


ja—xaai| scla—x,]/? 220 (2.0.13) 


for some c > 0. If p = 1, the sequence is said to converge linearly to 
a. In that case, we require c < 1; the constant c is called the rate of 
linear convergence of x, to a. 

Using this definition, the earlier example (2.0.5)—(2.0.6) has order of conver- 
gence 2, which is also called quadratic convergence. This definition of order is not 
always a convenient one for some linearly convergent iterative methods. Using 
induction on (2.0.13) with p = 1, we obtain 


la — x,{<c"la — xl n=0 (2.0.14) 


This shows directly the convergence of x, to a. For some iterative methods we can 
show (2.0.14) directly, whereas (2.0.13) may not be true for any c < 1. In such a 
case, the method will still be said to converge linearly with a rate of c. 


2.1 The Bisection Method 

Assume that f(x) is continuous on a given interval {a, b] and that it also satisfies 
f(a)f(b) <0 (2.1.1) 

Using the intermediate value Theorem 1.1 from Chapter 1, the function f(x) 

must have at least one root in [a, b]. Usually [a, b] is chosen to contain only one 

root a, but the following algorithm for the bisection method will always converge 

to some root a in [a, bj, because of (2.1.1). 

Algorithm  Bisect (f, a, b, root, €) 

1. Define c= (a + b)/2. 


2 If b—e<e, then accept root := c, and exit. 
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3. If sign(f(b)) - sign(f(c)) < 0, then a := c; otherwise, b := c. 
4. Return to step 1. 


The interval [a, b] is halved in size for every pass through the algorithm. 
Because of step 3, [a, b] will always contain a root of f(x). Since a root a is in 
[a, b], it must lie within either [a, c] or [c, b]; and consequently 


jc-al|<b-c=c-—a 


This is justification for the test in step 2. On completion of the algorithm, c will 
be an approximation to the root with 


jJc—al <e 


Example Find the largest real root a of 
f(x) =x®-x-1=0 (2.1.2) 


It is straightforward to show that 1 < a < 2, and we will use this as our initial 
interval [a, b]. The algorithm Bisect was used with «.= .00005. The results are 
shown in Table 2.1. The first two iterates give the initial interval enclosing a, and 
the remaining values c,,n > 1, denote the successive midpoints found using 
Bisect. The final value c,, = 1.13474 was accepted as an approximation to a with 
Ja — c,5| < .00004. 

The true solution is 


a = 1.13472413840152 (2.1.3) 
The true error in ¢,, is 
a — C1, = — 000016 


It is much smaller than the predicted error bound. It might seem as though we 
could have saved some computation by stopping with an earlier iterate. But there 


Table 2.1 Example of bisection method 


2.0 = 5 1.13672 


1.0=a -1.0 9 1.13477 .00043 
1 1.5 8.89063 10 1.13379 — 00960 
2 1.25 1.56470 11 1.13428 — 00459 
3 1.125 ~— 09771 12 1.13452 — .00208 
4 1.1875 61665 13 1.13464 — 00083 
5 1.15625 .23327 14 1.13470 — 00020 
6 1.14063 06158 15 1.13474 .00016 
7 1.13281 
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is no way to predict the possibly better accuracy in an earlier iterate, and thus 
there is no way we can know the iterate is sufficiently accurate. For example, cg is 
sufficiently accurate, but there was no way of telling that fact during the 
computation. 


To examine the speed of convergence, let c, denote the nth value of c in the 
algorithm. Then it is easy to see that 


a = limite, 
no 
1 j2 
ta cal s [5 (b- a) (2.1.4) 


where b — a denotes the length of the original interval input into Bisect. Using 
the variant (2.0.14) for defining linear convergence, we say that the bisection 
method converges linearly with a rate of +. The actual error may not decrease by 
a factor of } at each step, but the average rate of decrease is }, based on (2.1.4). 
The preceding example illustrates the result (2.1.4). 

There are several deficiencies in the algorithm Bisect. First, it does not take 
account of the limits of machine precision, as described in Section 1.2 of Chapter 
1. A practical program would take account of the unit round on the machine [see 
(1.2.12)], adjusting the given « if necessary. The second major problem with 
Bisect is that it converges very slowly when compared with the methods defined 
in the following sections. The major advantages of the bisection method are: (1) 
it is guaranteed to converge (provided f is continuous on [a, b] and (2.1.1) is 
satisfied), and (2) a reasonable error bound is available. Methods that at every 
step give“upper and lower bounds-on the root a are called enclosure methods. In 
Section 2.8, we describe an enclosure algorithm that combines the previously 
stated advantages of the bisection method with the faster convergence of the 
secant method (described in Section 2.3). 


2.2 Newton’s Method 


Assume that an initial estimate x9 is known for the desired root a of f(x).= 0. 
Newton’s method will produce a sequence of iterates {x,: n > 1}, which we hope 
will converge to a. Since x9 is assumed close to a, approximate the graph of 
y =f(x) in the vicinity of its root @ by constructing its tangent line at 
(Xo, f(X9)). Then use the root of this tangent line to approximate a; call this new 
approximation x,. Repeat this process, ad infinitum, to obtain a sequence of 
iterates x,. As with the example (2.0.3) beginning this chapter, this leads to the 
iteration formula ‘ 


_ fx) 
7's) 


n>0 (2.2.1) 


Xn+1 =~ Xn 


The process is illustrated in Figure 2.2, for the iterates x, and x,. 
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Figure 2.2 Newton’s method. 


Newton’s method is the best known procedure for finding the roots of an 
equation. It has been generalized in many ways for the solution of other, more 
difficult nonlinear problems, for example, systems of nonlinear equations and 
nonlinear integral and differential equations. It is not always the best method for 
a given problem, but its formal simplicity and its great speed often lead it to be 
the first method that people use in attempting to solve a nonlinear problem. 

As another approach to (2.2.1), we use a Taylor series development. Ex- 
panding f(x) about ~x,, 


(Sse 


with € between x and x,. Letting x = a and using f(a) = 0, we solve for a to 
obtain 


f(x,)  (a-x,)° f"(é,) 
fe) 2 f(x) 


with €, between x, and a. We can drop the error term (the last term) to obtain a 
better approximation to a than x,, and we recognize this approximation as x,,,.; 
from (2.2.1). Then 


et ) 
2f'(x,) 


n20 (2.2.2) 


a-Xx, 
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Table 2.2 Example of Newton’s method 


n Xn f(x,) a~ xX, ¥ Xn4y 7X, 
0 2.0 61.0 — 8.653E — 1 

1 1.680628273 19.85 —SA59E — 1 ~2,499E — 1 
2 1.430738989 6.147 —2.960E — 1 —1.758E — 1 
3 1.254970957 1.652 -1.202E — 1 —9.343E — 2 
4 1.161538433 2.943E — 1 —2.681E — 2 —2,519E — 2 
5 1.136353274 1.683E — 2 —1.629E — 3 —1.623E — 3 
6 1.134730528 6.574E — 5 —6.390E —- 6 ~6.390E — 6 
7 1.134724139 1.015E — 9 —9.870E — 11 —9.870E — 11 


This formula will be used to show that Newton’s method has a quadratic order of 
convergence, p = 2 in (2.0.13). 


Example We again solve for the largest root of 
f(x) =x®-~x-1=0 


Newton’s method (2.2.1) is used, and the results are shown in Table 2.2. The 
computations were carried out in approximately 16-digit floating-point arith- 
metic, and the table iterates were rounded from these more accurate computa- 
tions. The last column, x,,, — X,, IS an estimate of a — x,; this is discussed 
later in the section. 

The Newton method converges very rapidly once an iterate is fairly close to 
the root. This is illustrated in iterates x4, x5, X¢,X7- The iterates x9, xy, X2, X3 
show the slow initial convergence that is possible with a poor initial guess x). If 
the initial guess x, = 1 had been chosen, then x, would have been accurate to 
seven significant digits and x, to fourteen digits. These results should be 
compared with those of the bisection method given in Table 2.1. The much 
greater speed of Newton’s method is apparent immediately. 


Convergence analysis A convergence result will be given, showing the speed of 
convergence and also an interval from which initial guesses can be chosen. 


Theorem 2.1 Assume f(x), f’(x), and f’(x) are continuous for all x in some 
neighborhood of a, and assume f(a) = 0, f’(a) # 0. Then if xg is 
chosen sufficiently close to a, the iterates x,, n > 0, of (2.2.1) will 
converge to a. Moreover, 


HT Xn f"(a) - 
limit (a ee x,)° ee? 2f'(a) (2.2.3) 


proving that the iterates have an order of convergence p = 2. 
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Proof Pick a sufficiently smali interval J = [a ~ €, a + €] on which f(x) #0 
[this exists by continuity of f’(x)], and then let 


Max|f"(x)| 
— xe! 
2 Min | f’(x) | 
xe 
From (2.2.2), 
Ja — x,| < Mja — x,|? 
Mla —x,|<(Mla~xol)” | 
Pick Ja —x | <e¢ and M|ja—x,| <1. Then Mja— x,| <1, and 
M\a— x,| < M|a— x |, which says |a — x,| <«. We can apply the 
same argument to x,,X2,..., mductively, showing that |a — x,| <€ 


and Mla — x,| <1 for all n > 1. 
To show convergence, use (2.2.2) to give 


|a Xnsil < M\a *) ele 
Mla — Xy411 < (Mla — x,I)” (2.2.4) 
and inductively, 


Mla — x,| < (Mla — x9|)” 
1 a 
ja-x,| < yy (Mle — Xo]) (2.2.5) 


Since M]a — x] <1, this shows that x, - a as n > 00. 
In formula (2.2.2), the unknown point &, is between x, and a, 
implying £, > a as n — oo. Thus 


a— xX, "(é, —f"(a 
gd Oi mia NEY SG) 


ne eee) 27(x,)  2F (a) . 


The error column in Table 2.2 can be used to illustrate (2.2.3). In particular, 
for that example, 


f(a) a Fs 
say = 72410 > = 2 1 
2f'(a) 


(a — x5)" 


Let M denote the limit on the right side of (2.2.2). Then if x, is near a, (2.2.2) 
implies 


M(a ~ Xq41) = [M(a- x,)]? 


Serre enrieerae 
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Figure 2.3. The Newton—Fourier method. 


In order to have convergence of x, to a, this statement says that we should 
probably have : 


1 


Thus M is a measure of how close x, must be chosen to @ to ensure convergence 
to a. Some examples with large values of M are given in the problems at the end 
of the chapter. : 

Another approach to the error analysis of Newton’s method is given by the 
following construction and theorem. Assume f(x) is twice continuously differen- 
tiable on an interval [a, b] containing a. Further assume f(a) < 0, f(b) > 0, 
and that 


f(x)>0  f"(x)>0 for a<x<b 220) 
Then f(x) is strictly increasing on [a, 5], and there is a unique root @ in (a, b). 
Also, f(x) < 0 for a <x <a, and f(x) >0fora<x <b. 


Let x) = b and define the Newton iterates x, as in (2.2.1). Next, define a new 
sequence of iterates by 


f(z.) 
Zz, = , 

f'(%n) 
with z) = a. The resulting iterates are illustrated in Figure 2.3. With the use of 


{z, }, we obtain excellent upper and lower bounds for a. The use of (2.2.8) with 
Newton’s method is called the Newton—Fourier method. 


Zntl ag 


n>0 (2.2.8) 


Theorem 2.2 As previously, assume f(x) is twice continuously differentiable on 
{a, b], f(a) < 0, f(b) > 0, and condition (2.2.7). Then the iterates 
x, are strictly decreasing to a, and the iterates z, are strictly 
increasing to a. Moreover, 
Xnt1 ~ 2nt1 f"(a) 
Limit ———_—- = — 
nro (x,-2,) — 2f"(a) 
showing that the distance between x, and z, decreases quadrati- 
cally with n. 


(2.2.9) 
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Proof We first show that 


25: 2 Se Sy RS (2.2.10) 


From the definitions (2.2.1) and (2.2.8), 


Peat GLa 
; ‘ f'(xo) ; 
re —f (2p) +0 
1 0 f'(x) 
From the error formula (2.2.2), 
~xX,= —(a-x ye fa) 
ie Oa) 
Finally, 
a-z,=a-z ss f(z) — fa) 
; ° f'(X0) 4 f(x) 
=a Zo f'(%) | Zo 0 
f'(x0) - f’(So) 
‘ (a - 24 | >0 


because f’(x) is an increasing function on [a, b]. Combining these results 
proves (2.2.10). This proof can be repeated inductively to prove that 


Zn 2 ya, SH SX 4, <% n20 (2.2.11) 


: n n 


The sequence {x,,} is bounded below by a, and thus it has an infimum 
X; similarly, the sequence {z,} has a supremum 2: 


Taking limits in (2.2.1) and (2.2.8), we obtain 


gig. peel 
f(x) f(x) 


which leads to f(x) = 0 = f(z). Since a is the unique root of f(x) in 
{a, b], this proves {x,,} and {z,,} converge to a. 

The proof of (2.2.9) is more complicated, and we refer the reader to 
Ostrowski (1973, p. 70). From Theorem 2.1 and formula (2.2.3), the 
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sequence { x, } converges to a quadratically, The result (2.2.9) siiows that 
la — xy] S [Zn — Xn 


is an error bound that decreases quadratically. i 


The hypotheses of Theorem 2.2 can be reduced to f(x) being twice continu- 
ously differentiable in a neighborhood of and 


f(a) f(a) #0 (2.2.12) 


From this, there will be an interval [a, b] about « with f(x) and f(x) nonzero 
on the interval. Then the rootfinding problem f(x) = 0 will satisfy (2.2.7) or it 
can be easily modified to an equivalent problem that will satisfy it. For example, 
if f(a) < 0, f’(a) > 0, then consider the rootfinding problem g(x) = 0 with 
g(x) = f(—x). The root of g will be —a, and the conditions in (2.2.7) will be 
satisfied by g(x) on some interval about —«. The numerical illustration of 
Theorem 2.2 will be left until the Problems section. 


Error estimation The preceding procedure gives upper and lower bounds for 
the root, with the distance x, — z, decreasing quadratically. However, in most 
applications, Newton’s method is used alone, without (2.2.8). In that case, we use 
the following. 

Using the mean value theorem, 


f(x.) = fn) — fla) = fC) = ) 


with €, between x, and a. If f(x) is not changing rapidly between x, and a, 
then we have f’(&,) = f’(x,,), and 


—f(x,) 3 


a-x,= f'(x,) = Xnty —~ Xp 


with the last equality following from the definition of Newton’s method. For 
Newton’s method, the standard error estimate is 


a- X_* Xp41 — Xy (2.2.13) 


and this is illustrated in Table 2.2. For relative error, use 


a~X, Xn+1 —~ Xn 


The Newton algorithm Using the Newton formula (2.2.1) and the error estimate 
(2.2.13), we give the following algorithm. 
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Algorithm Newton (f, df, Xo, €, root, itmax, ier) 


1. Remark: df is the derivative function f‘(x), itmax is the 
maximum number of iterates to be computed, and ier is an error 
flag to the user. 


2. itnum = 1 

3. ecm = df(Xq). 

4. If denom = 0, then ier := 2 and exit. 

5. x, =X — f(X)/denom 

6. If |x, — Xo| < ¢, then set ier = 0, root = x,, and exit. 

7. If itnum = itmax, set ier = 1 and exit. 

8. Otherwise, itnum = itnum + 1, Xo = x;, and go to step 3. 


As with the earlier algorithm Bisect, no account is taken of the limits of the 
computer arithmetic, although a practical program would need to do such. Also, 
Newton’s method is not guaranteed to converge, and thus a test on the number of 
iterates (step 7) is necessary. 

When Newton’s method converges, it generally does so quite rapidly, an 
advantage over the bisection method. But again, it need not converge. Another 
source of difficulty in some cases is the necessity of knowing f’(x) explicitly. 
With some rootfinding problems, this is not possible. The method of the next 
section remedies this situation, at the cost of a somewhat slower speed of 
convergence. 


2.3. The Secant Method 


As with Newton’s method, the graph of y = f(x) is approximated by a straight 

line in the vicinity of the root a. In this case, assume that x, and x, are two 

initial estimates of the root «. Approximate the graph of y = f(x) by the secant 

line determined by (Xo, f(Xo)) and (x,, f(x,)). Let its root be denoted by x.; we 

hope it will be an improved approximation of a. This is illustrated in Figure 2.4. 
Using the slope formula with the secant line, we have 


f(x) —fl%) _ f(a) -0 
X1 — Xo + X,— Xy 


Solving for x., 
xy == Xo 


a EOI) 
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Figure 2.4 The secant method. 


Using x, and x., repeat this process to obtain x3, etc. The general formula based 
on this is 


Xn Xp 


ene 


Xnt1 = Xn =f(x,) 


This is the secant method. As with Newton’s method, it is not guaranteed to 
converge, but when it does converge, the speed is usually greater than that of the 
bisection method. 


Example Consider again finding the largest root of 
f(x) =x®-x-1=0 


The secant method (2.3.1) was used, and the iteration continued until the 
successive differences x, — X,, were considered sufficiently small. The numeri- 
cal results are given in Table 2.3. The calculations were done on a binary machine 
with approximately 16 decimal digits of accuracy, and the table results are 
rounded from the computer results. 

The convergence is increasingly rapid as n increases. One way of measuring 
this is to calculate the ratios 


a Xn4y 
——__ n2=0 
a~-X,, 


For a linear method these are generally constant as x,, converges to a. But in this 
example, these ratios become smaller with increasing n. One intuitive explanation 
is that the straight line connecting (x,_,, f(x,—)) and (x, f(x,)) becomes an 


a= | 
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Table 2.3 Example of secant method 


n xn f(x,) a- XxX, Xn+t Xp 
0 2.0 61.0 8.65E — 1 

1 1.0 -1.0 1.35E —1 1.61E — 2 
2 1.016129032 —9.154E —1 119E —1 L.74E - 1 
3 1.190577769 6.575E — 1 —5.59E — 2 —7.29E - 2 
4 1.117655831 —1.685E — 1 —171E —2 1.49E ~— 2 
5 1.132531550 —2.244E — 2 2.19E — 3 '2.29E — 3 
6 1.134816808 9.536E — 4 —9.27E — 5 ~9.32E — 5 
7 1.134723646 —5.066E — 6 4.92E —7 4.92E — 7 
9 1.134724138 —1.135E -— 9 1.10E — 10 1.10E — 10 


increasingly accurate approximation to the graph of y = f(x) in the vicinity of 
x =a, and consequently the root x,., of the straight line is an increasingly 
improved estimate of a. Also note that the iterates x, move above and below the 
root a in an apparently random fashion as n increases. An explanation of this 
will come from the error formula (2.3.3) given below. 


Error analysis Multiply both sides of (2:3:1) by —1 and then add a@ to both 
sides, obtaining 


2 Xn Xn 
ia a al ae OR Tr O| 


The right-hand side can be manipulated algebraically to obtain the formula 


O = Xy4 = —(@~ Xa eae (2.3.2) 


The quantities f[x,_,, x,] and -f[x,_,, x,, @] are first- and second-order Newton 
divided differences, defined in (1.1.13) of Chapter 1. The reader should check 
(2.3.2) by substituting from (1.1.13) and then simplifying. Using (1.1.14), formula 
(2.3.2) becomes 


tS) 
ems (@ eae)" sory (2.3.3) 


with £, between x,_, and x,, and {, between x,,_,, x,, and a. Using this error 
formula, we can examine the convergence of the secant method. 


Theorem 2.3 Assume f(x), f’(x), and f’(x) are continuous for all values of x 
in some interval containing «, and assume f’(a) # 0. Then if the 
initial guesses xq and x, are chosen sufficiently close to a, the 
iterates x, of (2.3.1) will converge to a. The order of convergence 
will be p = (1 + ¥5)/2 = 1.62. 


68 ROOTFINDING FOR NONLINEAR EQUATIONS 


Proof For the neighborhood J = [a — e,a + «] with some « > 0, f(x) #0 
everywhere on J. Then define 


Max | f”(x)| 
M= xel 
2 Min | f’(x) | 
xel 
Then for all xo, x, € [a — €, a + €], using (2.3.3), 
lea] < leil - Lleol M, 
Mle,| < Mie,| - Mleol 
Further assume that x, and x, are so chosen that 
5 = Max {Mleol, Mje,|} <1 (2.3.4) 
Then M|e,| < 1 since 
Mle,| < 8? 


Also Mje,| < 6? < 6 implies 
7) 
je,| < vo Max {|e,|, leo} << 


and thus x, © [a —e,a-+e]. We apply this argument inductively to 
show that x, €[a—e¢,a+e]and Mje,| <6 for n> 2. 

To prove convergence and obtain the order of convergence, continue 
applying (2.3.3) to get 


Mle3| < Mje,|- Mle,| <6?-6 = 8 


Mle,| < Mje,| - Mje,| < 6 


For 
Mle,| < 5% (2.3.5) 
Mleénsil < Mle,| - Mle,—1| < 8% *%-1 = beet 
Thus . 
Inst =In+4n-1 NBM (2.3.6) 


with gy = 4, = 1. This is a Fibonacci sequence of numbers, and an 
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explicit formula can be given: 


1 
qn = yg le —rf 1] n>0 (2.3.7) 
1+ 75 1-75 
Gia PI ae Ra 
Thus 
_ 1 nt+1 ? 
Qn = og (1.618) for large n (2.3.8) 


For example q, = 8, and formula (2.3.8) gives 8.025. Returning to 
(2.3.5), we obtain the error bound 


1 
lelsaoe wee (2.3.9) 


with q, given by (2.3.7). Since q, > 00 as n — 00, we have x, — a. 

By doing a more careful derivation, we can actually show that the 
order of convergence is p = (1 + ¥5)/2. To simplify the presentation, 
we instead show that this is the rate at which the bound in (2.3.9) 
decreases. Let B, denote the upper bound in (2.3.9). Then 


= = M0718 4n+1-709n 


<8 Me l=¢ 
= pntl 
because 9,41 — 9%, =f" > —1. Thus 


B 


r, 
n+i < cBy° 


which implies an order of convergence p = ry = (1 + v5 \/2. A similar 
result holds for the actual errors e,; moreover, 


(V5 -1)/2 


f"(a) 
2f'(a) 
The error formula (2.3.3) can be used to explain the oscillating behavior of the 


iterates x, about the root a in the last example. For x, and x,_, near to a, 
(2.3.3) implies 


ects lentil ec. 


(2.3.10) 


noo |e,|” 


(yee Na se say 


2f'(a) 
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The sign of a ~ x,,,, is determined from that of the previous two errors, together 
with the sign of f’(a)/f’(a). 

The condition (2.3.4) gives some information on how close the initial values x, 
and x, should be to a in order to have convergence. If the quantity M is large, or 
more specifically, if 


| f(a) 
2f’(a) 


is very large, then a ~ x) and a — x, must be correspondingly smaller. Conver- 
gence can occur without (2.3.4), but it is likely to initially be quite haphazard in 
such a case. 

For an error test, use the same error estimate (2.2.13) that was used with 
Newton’s method, namely 


AX Ane — Xn 


Its use is illustrated in Table 2.3 in the last example. Because the secant method 
may not converge, programs implementing it should have an upper limit on the 
number of iterates, as with the algorithm Newton in the last section. 

A possible problem with the secant method is the calculation of the approxi- 
mate derivative 


a, = Lee) =f) (2.3.12) 


Xn Xn-1 
where the secant method (2.3.1) is then written 


x. 
Xn =Xq~ fxn) (2.3.13) 
a 


The calculation of a, involves loss of significance errors, in both the numerator 
and denominator. Thus it is a less accurate approximation of the derivative of f 
as x, — a. Nonetheless, we continue to obtain improvements in the accuracy of 
X,, until we approach the noise level of f(x) for x near «. At that point, a, may 
become very different from f‘(a), and x,,, can jump rapidly away from the 
root. For this reason, Dennis and Schnabel (1983, pp. 31-32) recommend the use 
of (2.3.12) until x, —x,—, becomes sufficiently small. They then recommend 
another approximation of f'(x): 


f'(x,) =a,= —a 


with a fixed h. For h, they recommend 
h=¥6 -T, 


where T, is a reasonable nonzero approximation to a, say x,, and 6 is the 
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computer’s unit round [see (1.2.12)]. They recommend the use of h -when 
|x, —X,—;| is smaller than h. The cost of the secant method in function 
evaluations will rise slightly, but probably by not more than one or two. 

The secant method is well recommended as an efficient and easy-to-use 
rootfinding procedure for a wide variety of problems. It also has the advantage of 
not requiring a knowledge of f’(x), unlike Newton’s method. In Section 2.8, the 
secant method will form an important part of another rootfinding algorithm that 
is guaranteed to converge. 


Comparison of Newton’s method and the secant method Newton’s method and 
the secant method are closely related. If the approximation 


faye 
is used in the Newton formula (2.2.1), we obtain the secant formula (2.3.1). The 
conditions for convergence are almost the same [for example, see (2.2.6) and 
(2.3.4) for conditions on the initial error], and the error formulas are similar [see 
(2.2.2) and (2.3.3)]. Nonetheless, there are two major differences. Newton’s 
method-requires-two function evaluations per iterate, that of f(x,) and f’(x,), 
whereas the secant method requires only one function evaluation per iterate, that 
of f(x, [provided the needed function value f(x,_,) is retained from the last 
iteration]. Newton’s method is generally more expensive per iteration. On the 
other hand, Newton’s method converges more rapidly [order p = 2 vs. the secant 
method’s p = 1.62], and consequently it will require fewer iterations to attain a 
given desired accuracy. An analysis of the effect of these two differences in the 
secant and Newton methods is given below. 
We now consider the expenditure of time necessary to reach a desired root « 
within a desired tolerance of «. To simplify the analysis, we assume that the 
initial guesses are quite close to the desired root. Define 


‘and let xp = Xp. We define x, based on the following convergence formula. From 
(2.2.3) and (2.3.10), respectively, 


f(a) 
_ = = 2 0, aes 
Ja Xnvrl cla x, | n 2 c 2f'(a) 
1+ 75 
ja- Xn4;| = dla—X,|" r= d=c""} 


t 
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Inductively for the error in the Newton iterates, 
= 2 
cla _ Xnail = (cla a xa\) 


cla — x,| = (cla — x9|)” 
1 2" 
Ja—x,| + —(cla~ xl)” n 20 


Similarly for the secant method iterates, 
Ja — x,| = dla — X,_4|" 
= gitrt- +N oy i Xl" 
Using the formula (1.1.9) for a finite geometric series, we obtain 
gitrte +r7t dir -V)/G-Y) a er"-l 


and thus 
=) prt} = ye 1 rn 
ja —x,| =c" “la — Xo} = tlela — xol] 


To satisfy |a — x,| <€, for the Newton iterates, we must have 
2" 
(cla — X9|)” <ce 


K log ec 

n> — #£K=log}|————— 
log cla — Xo| 

Let m be the time to evaluate f(x), and let s - m be the time to evaluate f’(x). 

Then the minimum time to obtain the desired accuracy with Newton’s method is 


(1+ s)mK 


Ty = (m+ ms)n = (2.3.14) 


log 2 


For the secant method, a similar calculation shows that |a — x,| < ¢ if 


Thus the minimum time necessary to obtain the desired accuracy is 


mK 


(2.3.15) 
log r 


T, = mn = 
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To compare the times for the secant method and Newton’s method, we have 


log 2 
~ (1 +s) logr 


ala 


The secant method is faster than the Newton method if the ratio is less than one, 


T 

aos < 1 

Ty 
#0 2 

: —l]= 44 (2.3.16) 
* tog r 


If the time to evaluate f’(x) is more than 44 percent of that necessary to evaluate 
f(x), then the secant method is more efficient. In practice, many other factors 
will affect the relative costs of the two methods, so that the .44 factor should be 
used with caution. 

The preceding argument is useful in illustrating that the mathematical speed of 
convergence is-not-the.complete picture. Total computing time, ease of use-of an 
algorithm, stability, and other factors also have a bearing on the relative 
desirability of one algorithm over another one. 


2.4 Miuller’s Method 


Muller’s method is useful for obtaining both real and complex roots of a 
function, and it is reasonably straightforward to implement as a computer 
program. We derive it, discuss its convergence, and give some numerical exam- 
ples. 3 

Muller’s method is a generalization of the approach that led to the secant 
method. Given three points x9, x,, X,, a quadratic polynomial is constructed that 
passes through the three points (x;, f(x;)), i = 0,1,2; one of the roots of this 
polynomial is used as an improved estimate for a root a of f(x). 

The quadratic polynomial is given by 


p(x) = f(x2) + (x — x2) f[ x2, x1] + (x — x2)(x — x1) f [x2, x1, Xo]. 
(2.4.1) 


The divided differences f[x2, x,] and f[x , x,, x9] were defined in a. 1.13) of 
Chapter 1. To check that 


p(x;) =f(x;) P= 0, 1,2 
Just substitute x; into (2.4.1) and then reduce the resulting expression using 


(1.1.13). There are other formulas for p(x) given in Chapter 3, but the form 
shown in (2.4.1) is the most convenient for defining Muller’s method. The 
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formula (2.4.1) is called Newton’s divided difference form of the interpolating 
polynomial, and it is developed in general in Section 3.2 of Chapter 3. 
To find the zeros of (2.4.1) we first rewrite it in the more convenient form 


y =f (xz) + w(x — x.) + f[x2,%1, xol] (x - 5)? 
w = f[x.,x,] + (x, = xi )f [x2 X1, Xo] 
= f[x2, x1] + f[x2, xo] =f 1X5 Ril 


We want to find the smallest value of x — x, that satisfies the equation y = 0, 
thus finding the root of (2.4.1) that is closest to x,. The solution is 


aphtas —wt yw? — 4f (x2) f[x2, x1, Xo] 
_ 2 = 
: 2f[x2, x1, Xo] 

with the sign chosen to make the numerator as small as possible. Because of the 
loss-of-significance errors implicit in this formula, we rationalize the numerator 
to obtain the new iteration formula 


fons eee |) (2.4.2) 


wt yw? — 4f(x2)f[x2, x1, Xo] 
J 
with the sign chosen to maximize the magnitude of the denominator. 
Repeat (2.4.2) recursively to define a sequence of iterates {x,,: n > O}. If they 
converge to a point a, and if f(a) # 0, then a is a root of f(x). To see this, use 
(1.1.14) of Chapter 1 and (2.4.2) to give 


wof'(e) as n> 0 


2f(a) 


F(a) £ HEF Cal? ~ 24a) F(a) 


showing that the right-hand fraction must be zero. Since f’(a) # 0 by assump- 
tion, the method of choosing the sign in the denominator implies that the 
denominator is nonzero. Then the numerator must be zero, showing f(a) = 0. 
The assumption f’(a) # 0 will say that a@ is a simple root. (See Section 2.7 for a 
discussion of simple and multiple roots.) 

By an argument similar to that used for the secant method, it can be shown 
that 
f(a) (p-)/2 
6f'(a) 


Limit |a ~ Xnetl -| 


p= 1.84 (2.4.3) 


noo la = x,,|? 


provided f(x) is three times continuously differentiable in a neighborhood of a 
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and f’(a) # 0. The order p is the positive root of 


xe-x?-x-1=0 

With the secant method, real choices of xg and x, lead to a real value of x3. 
But with Muller’s method, real choices of x9, x,, x, can and do lead to complex 
roots of {(x). This is an important aspect of Muller’s method, being one reason it 
is used. 

The following examples were computed using a commercial program that gives 
an automatic implementation of Muller’s method. With no initial guesses given, 
it found the roots of f(x) in roughly increasing order. After approximations 
z,,---, 2, had been found as roots, the function 


(2.4.4) 


was used in finding the remaining roots of f(x). [For a discussion of the errors in 
this use of g(x), see Peters and Wilkinson (1971)]. In order that an approximate 
root z be acceptable to the program, it had to satisfy one of the following two 
conditions (specified by the user): 


1. | f(z} < 107” 

2. z has eight significant digits of accuracy. 

In Tables 2.4 and 2.5, the roots are given in the order in which they were found. 
The column IT gives the number of iterates that were calculated for each root. 


The examples are all for f(x) a polynomial, but the program was designed for 
general functions f(x), with x allowed to be complex. 


Table 2.4 Muller’s method, example 2 


IT Root / (root) 
9 1.1572211736E — 1 5.96E — 8 
10 6.1175748452E — 1 + 9.01E — 207 -2.98E — 7 +9.06E — lli 
14 2.8337513377EO — 5.05E — 17i 2.55E —5 —4.78E — 87 
13 4.5992276394E0 — 5.95E —- 135i 7.13E—5 +9.37E — 63 
8 1.5126102698EO + 2.98E — 16) 3.34E —6 -—2.35E — 7i 
19 1.3006054993E1 + 9.04E — 18/ 2.32E —1 +4.15E — 7i 
16 9.6213168425EO  — 4.97E — 17i ~3.66E —2 —5.38E — 7i 
14 1.7116855187El — 8.48E — 17i —1.68E +0 +2.40E — 5i 
13 2.2151090379E1 + 9.35E — 18i 8.61E —1 +2.60E — 5i 
7 6.8445254531E0 — 3.43E — 287 ~449E — 3 -—1.22E — 18 
4 2.8487967251E1 + 5.77E — 25i -~6.34E +1 -—2.96E - 11i 
4 3.7099121044E1 + 2.80E — 24i 2.12E 3 +7.72E —- 9i 


re cere 


76 ROOTFINDING FOR NONLINEAR EQUATIONS 


Table 2.5 Muuller’s method, example 3 


IT Root f (root) 

41 2.9987526 —6.98E — 4: —3.33E — 11 +6.70E — lli 

17 2.9997591 —2.68E — 4i 5.68E ~ 14 —6.48E — 14: 

31 3.0003095 —3.17B — 4i —3.41E — 13 +3.22E - 14: 

10 3.0003046 +3.14E — 4i 3.98E — 13 -3.83E - 14i 
6 5.91E — 15 +3.000000000i 4.38E — 11 —1.19E - 11 
3 5.9TE — 15 —3.000000000i 4.38E — 11 +1.19E - 11i 


Example 1. f(x) = x°-— 1. All 20 roots were found with an accuracy of 10 
or more significant digits. In all cases, the approximate root z satisfied |f(z)| < 
10~1°, generally much less. The number of iterates ranged from 1 to 18, with an 
average of 8.5. 


2. f(x) = Laguerre polynomial of degree 12. The real parts of the roots as 
shown are correct, rounded to the number of places shown, but the imaginary 
parts should all be zero. The numerical results are given in Table 2.4. Note that 
f(x) is quite large for many of the approximate roots. 


3. 
f(x) = x® — 12x5 + 63x4 — 216x3 + 567x? — 972x + 729 


= (x? + 9)(x — 3)* 


The numerical results are given in Table 2.5. Note the inaccuracy in the first four 
roots, which is inherent due to the noise in f(x) associated with a = 3 being a 
repeated root. See Section 2.7 for a complete discussion of the problems in 
calculating repeated roots. 


The last two examples demonstrate why two error tests are necessary, and they 
indicate why the routine requests a maximum on the number of iterations to be 


allowed per root. The form (2.4.2) of Muller’s method is due to Traub (1964, 
pp. 210-213). For a computational discussion, see Whitley (1968). 


2.5 A General Theory for One-Point Iteration Methods 
We now consider solving an equation x = g(x) for a root a by the iteration 
Xn+1 = g(x,) ne 0 (2.5.1) 


with x, an initial guess to a. The Newton method fits in this pattern with 


epee) 
Bae Gy (2.5.2) 
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Table 2.6 Iteration examples 


forx?-—3=0 

Case (i) Case (ii) Case (iii) 
n Xn Xn Xn 
0 2.0 2.0 2.0 
1 3.0 1.5 1.75 
2 9.0 2.0 1.732143 
3 87.0 1.5 1.732051 


Each solution of x = g(x) is called a fixed point of g. Although we are usually 
interested in solving an equation f(x) = 0, there are many ways this can be 
reformulated as a fixed-point problem. At this point, we just illustrate this 
reformulation process with some examples. 


Example Consider solving x* — a = 0 for a > 0. 


(i) x = x* + x — a, or more generally, x = x + c(x? — a) for some c # 0 


(ii) x= = 
x 


ii) x= s(x + <) (2.5.3) 


We give a numerical example with a = 3, x) = 2, and a = ¥3 = 1.732051. With 
Xq = 2, the numerical results for (2.5.1) in these cases are given in Table 2.6. 


It is natural to ask what makes the various iterative schemes behave in the way 
they do in this example. We will develop a general theory to explain this behavior 
and to aid in analyzing new iterative methods. 


Lemma 2.4 Let g(x) be continuous on the interval a < x < b, and assume that 
a < g(x) <b for every a< x <b. (We say g sends [a, b] into 
[a, b], and denote it by g({a, b]) C [a, b].) Then x = g(x) has at 
least one solution in [a, b]. 


Proof Consider the continuous function g(x) — x. At x = a, it is positive, and 
at x = 6 it is negative. Thus by the intermediate value theorem, it must 
have a root in the interval [a, 5]. In Figure 2.5, the roots are the 
intersection points of y = x and y = g(x). a 


Lemma 2.5 Let g(x) be continuous on [a, 5], and assume g([a, b]) C [a, b]. 
Furthermore, assume there is a constant 0 < A < 1, with 


Ig(x) — g(y)| SAlx—y| forall x,y [a,b] (2.5.4) 


78 ROOTFINDING FOR NONLINEAR EQUATIONS 


Then x = g(x) has a unique solution « in [a, b}. Also, the iterates 
Xn a 8(X,-1) ne 1 


will converge to a for any choice of x, in [a, b], and 


n 


ja — x,| S$ ~—y1*1 — Xol (2.5.5) 


Proof Suppose x — g(x) has two solutions « and £ in [a, b]. Then 
ja — Bl =|g(a) ~ g(B)| < Ala — Bl 
(1-A)la - B| <0 


Since 0 < A < 1, this implies a = 8. Also we know by the earlier lemma 
that there is at least one root a in [a, 5]. 

To examine the convergence of the iterates x,,, first note that they all 
remain in [a, 5]. To see this, note that the result 


x,€[a,b] implies x,,,=g(x,) €[a, 6] 


can be used with mathematical induction to prove x, © [a, b] for all n. 
For the convergence, 


ja — Xneil i lg(a) < g(x,,)| < Ala a X nl (2.5.6) 
and by induction, 
ja — x,| <A"|a — Xo n=O (2.5.7) 


As n— oo, A" — 0; thus, x, > a. 


Figure 2.5 Example of Lemma 2.4. 
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To prove the bound (2.5.5), begin with 
Ja — Xo] < Ja — xy| + [xy — Xo| S Ala — xo] + [x1 — Xol 


where the last step used (2.5.6). Then solving for |a — x,|, we have 


1 
Toy — Xo| (2.5.8) 
Combining this with (2.5.7) will complete the proof. i) 


la — x9] < 


The bound (2.5.6) shows that the sequence {x,,} is linearly convergent, with 
the rate of convergence bounded by A, based on the definition (2.0.13). Also from 
the proof, we can devise a possibly more accurate error bound than (2.5.5). 
Repeating the argument that led to (2.5.8), we obtain 


ja 4 Xn| s Toya a Xn] 
Further, applying (2.5.6) yields the bound 
r 
la Xpaal S Tyner 7 al (2.5.9) 


When A is computable, this furnishes a practical bound in most situations. Other 
error bounds and estimates are discussed in the following section. 
If g(x) is differentiable on [a, b], then 


e(x)-—g(y)=8'(é)(x-y) — & between x and y 
for all x, y € [a, b]. Define 
A= Max lg’(x)I 
a<x<b 
Then 
Ig(x) —g(y)) <Ajx—y| all. x, ye [a,b] 


Theorem 2.6 Assume that g(x) is continuously differentiable on [a, b], that 
g([a, b]) C [a, b], and that 
A= Max | g(x) <1 (2.5.10) 
Then 
(i) x = g(x) has a unique solution a in [a, b] 
(ii) For any choice of x9 in {a, b], with x,,, = g(x,), n 2 0, 


Limitx, =a 


n> co 
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(iii) 
a—x,| <A"la— xo| < Toy Xol 
. HT Xa . 
Limit ———— = g’(a) (2.5.11) 


n+o Q@— X, 


Proof Every result comes from the preceding lemmas, except for the rate of 
convergence (2.5.11). For it, use 


a —X,4, = 8(a) — g(x,) = 8'(f,)(a—x,)  n2>O0 (2.5.12) 
with €, an unknown point between a and x,. Since x, > a, we must 


have £, — a, and thus 


a as Xn . . 
Limit ——“** = Limit g’(é,) = g(a) 
noo a- xX, noo 


If g(a) * 0, then the sequence {x,} converges to a with order exactly 
p = 1, linear convergence. @ 


O<gifa)<1 —1<g'(ay<0 


Xx, & Xq Xe 


gla)<—l 


Figure 2.6 Examples of convergent and nonconvergent sequences 
Xnei = 8(%q)- 
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This theorem generalizes to systems of m nonlinear equations in m unknowns. 
Just regard x as an element of R”, g(x) as a function from R™ to R”, replace the 
absolute values by vector and matrix norms, and replace g’(x) by the Jacobian 
matrix for g(x). The assumption g([a, b]) C [a, b] must be replaced with a 
stronger assumption, and care must be exercised in the choice of a region 
generalizing [a, b]. The lemmas generalize, but they are nontrivial to prove. This 
is discussed further in Section 2.10. 

To see the importance of the assumption (2.5.10) on the size of g’(x), suppose 
|g’(a)| > 1. Then if we had a sequence of iterates x,,, = g(x,) and a root 
a = g(a), we have (2.5.12). If x,, becomes sufficiently close to a, then |g’(¢,,)| > 1 
and the error ja — x,,,,| will be greater than |a — x,|. Thus convergence is not 
possible if |g’(a)| > 1. We graphically portray the computation of the iterates in 
four cases (see Figure 2.6). 

’ To simplify the application of the previous theorem, we give the following 
result. 


Theorem 2.7 Assume « is a solution of x = g(x), and suppose that g(x) is 
continuously differentiable in some neighboring interval about a 


with [g’(a)| <1. Then the results of Theorem 2.6 are still true, 
provided x, is chosen-sufficiently close to a. 


Proof Pick a number ) satisfying |g’(a)| <> <1. Then pick an interval 
Il=[a-eat+e]with — 


Max|g’(x)| <A <1 
xel 


We have g(J) C J, since |a — x| < € implies 
Ja — g(x)| = g(a) — g(x)| = Ig’(é)| - Ja — x] <Ala— x] <e 
Now apply the preceding theorem using [a, b] = [a — €, a + €]. | 
Example Referring back to the earlier example in this section, calculate gies 
(i) g(x)=x?4+x-3 g(a) =g'(V3) = 2¥3 +1>1 


3 —3 
@ s)= > B03) ~ Taam oI 


1 3 1 3 
Gi) g(x)= 5+ =) B= FG-) B3)=0 


Example For x = x + c(x? — 3), pick ¢ to ensure convergence. Since the 
solution is « = 73, and since g’(x) = 1 + 2cx, pick ¢ so that 


-1<1+2cy3 <1 
For a good-rate of convergence, pick c so that 


1 + 2c¥3 =0 
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Table 2.7 Numerical example of iteration (2.5.13) 


n Xn a— xX, Ratio 
0 2.0 —2.68E - 1 
1 1.75 —1.79E - 2 .0668 
2 1.7343750 —2.32E -— 3 .130 
3 1.7323608 —3.10E -— 4 .134 
4 1.7320923 -4.15E — $5 134 
5 1.7320564 —5.56E — 6 134 
6 1.7320516 —7.45E - 7 .134 
7 1.7320509 —1.00E -— 7 134 
This gives 

—] 

oe: = 
2y3 


Use c = — +. Then g'(V3 y=H1- (3 /2) = 134. This gives the iteration scheme 


1 
_ rac —3) n>0. (2.5.13) 


n 


The numerical results are given in Table 2.7. The ratio column gives the values of 


aT X_-1 
The results agree closely with the theoretical value of g(v3 ). 


Higher order one-point methods We complete the development of the theory 
for one-point iteration methods by considering methods with an order of conver- 
gence greater than one, for example, Newtons’ method. 


Theorem 2.8 Assume a is a root of x = g(x), and that g(x) is p times 
continuously differentiable for all x near to a, for some p > 2. 
Furthermore, assume 


g(a) = ++: = ga) =0 (2.5.14) 
Then if the initial guess x, is chosen sufficiently close to a, the 
iteration 
Xn = g(x,) n>0 


will have order of convergence p, and 


ack wl a Xue) -t 
Limit ——— = (-1)’"'- 
fae (oe Ye ru 
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Proof Expand g(x,,) about a: 


Xq41= B(X,) = g(a) + (x, — a)g’(a) +--+ + Gare) g'?-)( a) 


-  (p- DV! 


_ a\P 
y Baa) 000(8,) 


for some &, between x, and a. Using (2.5.14) and a = g(a), 


P 
gP)(é,) 


(x, — @) 
Sea eS ara 


Use Theorem 2.7 and x, -> a to complete the proof. i 


The Newton method can be analyzed by this result 


gO yp pI OFO 
g(x) = f'(x) g( ) [f(x]? 


_ f(a) 
F(a) 


This and (2.5.14) give the previously obtained convergence result (2.2.3) for 
Newton’s method. For other examples of the application of Theorem 2.8, see the 
problems at the end of the chapter. 

The theory of this section is only for one-point iteration methods, thus 
eliminating the secant method and Muller’s method from consideration. There is 
a corresponding fixed-point theory for multistep fixed-point methods, which can 
be found in Traub (1964). We omit it here, principally because only the one-point 
fixed-point iteration theory will be needed in later chapters. 


g(a)=0 g(a) 


2.6 Aitken Extrapolation for 
Linearly Convergent Sequences 


From (2.5.11) of Theorem 2.6, 


ree: Xn+l 
Limit ———— = g(a) (2.6.1) 
a-x, 


xo 
for a convergent iteration 
Xna1 = 8(X,) n>0 


In this séction, we concern ourselves only with the case of linear convergence. 
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Thus we will assume 
0 < |g’(a)| <1 (2.6.2) 


We examine estimating the error in the iterates and give a way to accelerate the 
convergence of { x, }. 
We begin by considering the ratios 


Xn 7 Xn-1 
Wao on (2.6.3) 


Xn—1 ~ %n-2 

Claim: 
Limit A,, = g’(a) (2.6.4) 
n= 0 

To see this, write 


(w — x,-1) - (a ~ x,) 


(6 aya) = le 5,4) 


Using (2.5.12), 


a (a — x,-1) — g'(é,-1)(a az xi) . Vee b) 
oe Ae Xn-1)/[8'(En-2)1 +(a—%,_1) 1/[g’(é,-2)] ~1 
re ee A ea 


n—* 00 1/[g(«)] - 1 
The quantity A, is computable, and when it converges empirically to a value i, 
we assume A = g(a). 
We use A,, = g’(a) to estimate the error in the iterates x,. Assume 
aX, = A,(a v X,-1) 
Then 


a-x,= (a a Xya5) a (X25 v= a) 


1 
= waG a xn) 5 (Xp-1 oa x,) 


. A, 
a-x, = Toa, (* = Xn-1) (2.6.5) 


This is Aitken’s error formula for x,, and it is increasingly accurate as {A,,} 
converges to g’(a). 
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Table 2.8 Iteration (2.6.6) 


Estimate 

n Xn Xq — Xpot An a- x, (2.6.5) 
0 6.0000000 1.55E — 2 
1 6.0005845 5.845E — 4 1.49E — 2 
2 6.0011458 5.613E — 4 .9603 1.44E — 2 1.36E — 2 
3 6.0016848 , 5.390E — 4 -9604 1.38E — 2 1.31E ~ 2 
4 6.0022026 S.178E — 4 9606 1.33E — 2 1.26E — 2 
5 6.0027001 4.974E — 4 .9607 1.28E — 2 1.22E — 2 
6 6.0031780 4.780E — 4 9609 1.23E -— 2 1L17E -— 2 
7 6.0036374 4.593E — 4 .9610 1L.18E — 2 1.13E — 2 
Example Consider the iteration 

Xn41 = 6.28 + sin (x,) n>0 (2.6.6) 


The true root is « = 6.01550307297. The results of the iteration are given in 
Table 2.8, along-with the-values of A,,, @ — X,,, X, — X,— 1, and the error estimate 
(2.6.5). The values of A,, are converging to 


g(a) = cos(a) + .9644 


and the estimate (2.6.5) is an accurate indicator of the true error. The size of 
g(a) also shows that the iterates will converge very slowly, and in this case, 
X,+1 7 X, iS not an accurate indicator of a — x,. 


n 


Aitken’s extrapolation formula is simply (2.6.5), rewritten as an estimate of a: 


A 
a=x, + 7 X, (x, — Xp-1) (2.6.7) 


We denote this nght side by %,,. for n > 2. By substituting (2.6.3) into (2.6.7), the 
formula for X can be rewritten as 


(x, - oi) 
£, =x, — pp 2 (2.6.8) 


(x5 oe Xi) ~. (x5 aa x,-2) 
which is the formula given in many texts. 


Example Use the results in Table 2.8 for iteration (2.6.6). With n = 7, using 
either (2.6.7) or (2.6.8), 


£, = 6.0149518 a-—%,=5.51E-4 


Thus the extrapolate X, is significantly more accurate than x. 


86 ROOTFINDING FOR NONLINEAR EQUATIONS 


We now combine linear iteration and Aitken extrapolation in a simpleminded 
algorithm. 


Algorithm Aitken (g, Xo, €, root): 


1. Remark: It is assumed that |g’(a)| < 1 and that ordinary linear 
iteration using x, will converge'to a. 


2 x, = B(Xo) X= B(X). 


2 
ee es (x2 — x) 
. - : (x, — x1) — (x, — %9) 


4. If |%, — x.| <, then root = %, and exit. 
5. Set x, = X, and go to step 2. 


This algorithm will usually converge, provided the assumptions of step 1 are 
satisfied. 


Example To illustrate algorithm Aitken, we repeat the previous example (2.6.6). 
The numerical results are given in Table 2.9. The values x3, x,, and x, are the 
Aitken extrapolates defined by (2.6.7). The values of A, are given for only the 
cases n = 2, 5, and 8, since only then do the errors a — x,, a ~ x and 
a — xX,» decrease linearly, as is needed for A,, = g’(a). 


n-1) 


Extrapolation is often used with slowly convergent linear iteration methods for 
solving large systems of simultaneous linear equations. The actual methods used 
are different from that previously described, but they also are based on the 
general idea of finding the qualitative behavior of the error, as in (2.6.1), and of 
then using that to produce an improved estimate of the answer. This idea is also 
pursued in developing numerical methods for integration, solving differential 
equations, and other mathematical problems. 


Table 2.9 Algorithm Aitken applied to (2.6.6) 


n X, An a-~ xX, 
0 6.000000000000 1.55E — 2 
1 6.000584501801 149E — 2 
2 6.001145770761 -96025 1.44E — 2 
3 6.014705147543 7.98E — 4 
4 6.014733648720 7,69E — 4 
5 6.014761128955 96418 7.42E — 4 
6 6.015500802060 2.27TE — 6 
7 6.015500882935 2.19E — 6 
8 6.015500960931 96439 211E — 6 
9 6.015503072947 2.05E — 11 
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2.7 The Numerical Evaluation of Multiple Roots 


We say that the function {(x) has a root a of multiplicity p > 1 if 


f(x) = (x — «)’h(x) (2.7.1) 


with h(a) # 0 and h(x) continuous at x = a. We restrict p to be a positive 
integer, although some of the following is equally valid for nonintegral values. If 
h(x) is sufficiently differentiable at x = a, then (2.7.1) is equivalent to 


f(a) =f'(a)= ++» =fP-%Ma)=0 f(a) #0 = (2.7.2) 


When finding a root of any function on a computer, there is always an interval 
of uncertainty about the root, and this is made worse when the root is multiple. 
To see this more clearly, consider evaluating the two functions f,(x) = x? — 3 
and f,(x) = 9 + x?(x? — 6). Then a = v3 has multiplicity one as a root of h 
and multiplicity two as a root of f,. Using four-digit decimal arithmetic, 
f(x) < 0 for x < 1.731, f,(1.732) = 0, and f,(x) > 0 for x > 1.733. But f,(x) 
= 0 for 1.726 < x < 1.738, thus limiting the amount of accuracy that can be 
attained in finding a root of f,(x). A second example of the effect of noise in the 
evaluation of a multiple-root.is illustrated for f(x) = (x — 1)? in Figures 1.1 and 
1.2 of Section 1.3 of Chapter 1. For a final example, consider the following 
example. 


Example Evaluate 
f(x) = (x — 11)(x -— 2.1) 
= x*— 5.4x3 + 10.56x? — 8.954x + 2.7951 (2.7.3) 


on an IBM PC microcomputer using double precision arithmetic (in BASIC), The 
coefficients will not enter exactly because they do not have finite binary expan- 
sions (except for the x* term). The polynomial f(x) was evaluated in its 
expanded form (2.7.3) and also using the nested multiplication scheme 


f(x) = 2.7951 + x(—8.954 + x(10.56 + x(—5.44x))) (2.7.4) 


Table 2.10 Evaluation of f(x) = (x — 1.1)°(x — 2.1) 


x f(x): (2.7.3) f(x): (2.74) 
1.099992 3.86E — 16 5.55E — 16 
1.099994 3.86E — 16 2.76E — 16 
1.099996 2.76E — 16 0.0 
1.099998 —5.55E—17 LIE — 16 
1.100000 5.55E —17 0.0 
1.100002 5.55E — 17 5.55B — 17 
1.100004 ~5.55E — 17 0.0 
1.100006 -1.67E — 16 —167E — 16 


1.100008 —6.11E — 16 —5.00E — 16 
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y y 


Simple root Double root 


Figure 2.7 Band of uncertainty in evaluation of a function. 


The numerical results are given in Table 2.10. Note that the arithmetic being 
used has about 16 decimal digits in the floating-point representation. Thus, 
according tothe numerical results in the table, no more than 6 digits of accuracy 
can be expected in calculating the root a = 1.1 of f(x). Also note the effect of 
using the different representations (2.7.3) and (2.7.4). 

There is uncertainty in evaluating any function f(x) due to the use of finite 
precision arithmetic with its resultant rounding or chopping error. This was 
discussed in Section 3 of Chapter 1, under the name of noise in function 
evaluation. For multiple rdots, this leads to considerable uncertainty as to the 
location of the root. In Figure 2.7, the solid line indicates the graph of y = f(x), 
and the dotted lines give the region of uncertainty in the evaluation of f(x), 
which is due to rounding errors and finite-digit arithmetic. The interval of 
uncertainty in finding the root of f(x) is given by the intersection of the band 
about the graph of y = f(x) and the x-axis. It is clearly greater with the double 
root than with the simple root, even though the vertical widths of the bands 
about y = f(x) are the same. 


Newton’s method and multiple roots Another problem with multiple roots is 
that the earlier rootfinding methods will not perform as well when the root being 
sought is multiple. We now investigate this for Newton’s method. 
We consider Newton’s method as a fixed-point method, as in (2.5.2), with f(x) 
satisfying (2.7.1): 
_ f(x) 
f(x) 
Before calculating g’(a), we first simplify g(x) using (2.7.1): 


f(x) = (x — a)? h(x) + p(x - a)? h(x) 


(= ila) 
pix) + (x= a)¥() 


nar = B(X,) g(x) =x 


g(x) =x 
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Table 2.11 Newton’s method for (2.7.6) 


n Xn I(x,) a- x, Ratio 
0 1.22 —1.88E — 4 1.00E — 2 

1 1.2249867374 —4.71E — 5 5.01E — 3 

2 1.2274900222 —1.18E — 5 2.51E — 3 502 
3 1.2287441705 —2.95E — 6 1.26E — 3 501 
4 1.2293718746 —7,38E — 7 6.28E — 4 501 
5 1.2296858846 —1.85E - 7 3.14E — 4 500 
18 1.2299999621 —2.89E — 15 3.80E — 8 .505 
19 1.2299999823 —6.66E — 16 1.77E — 8 525 
20 1.2299999924 —LIUIE — 16 7.58E — 9 496 
21 1.2299999963 0.0 3.66E — 9 .383 
Differentiating, 

h(x) 


ES Gye Gana) 


=e ae. eC? 
oa are ph(x) + (x — a)h'(x) 


1 
g(a) =1-— #0 for p>1l (2.7.5) 
P 


Thus Newton’s method is a linear method with rate of convergence ( p — 1)/p. 
Example Find the smallest root of 


f(x) = —4.68999 + x(9.1389 + x(—5.56 + x)) (2.7.6) 


using Newton’s method. The numerical results are shown in Table 2.11. The 
calculations were done on an IBM PC microcomputer in double precision 
arithmetic (in BASIC). Only partial results are shown, to indicate the general 
course of the calculation. The column labeled Ratio is the rate of linear 
convergence as measured by A,, in (2.6.3). 


The Newton method for solving for the root of (2.7.6) is clearly linear in this 
case, with a linear rate of g’(x) =}. This is consistent with (2.7.5), since 
a = 1.23 is a root of multiplicity p = 2. The final iterates in the table are being 
affected by the noise in the computer evaluation of f(x). Even though the 
floating-point representation contains about 16 digits, only about 8 digits of 
accuracy can be found in this case. 
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To improve Newton’s method, we would like a function g(x) for which 
g’(a) = 0. Based on the derivation of (2.7.5), define 


g(x) =x > pl) 
f(x) 
Then easily, g’(a) = 0; thus, 
& — Xn41 = Ba) — B(x) 
= ~8'(a)(x, — a) — 3(x, - a)9"(,) 
with £, between x, and a. Thus 
a= taps SHaz,) 3G) 


showing that the method 


n=0,1,2,..., (2.7.7) 


has order of convergence two, the same as the original Newton method for simple 
roots. 

Example Apply (2.7.7) to the preceding example (2.7.6), using p = 2 for a 
double root. The results are given in Table 2.12, using the same computer as 
previously. The iterates converge rapidly, and then they oscillate around the root. 
The accuracy (or lack of it) reflects the noise in f(x) and the multiplicity of the 
root. 


Newton’s method can be used to determine the multiplicity p, as in Table 2.11 
combined with (2.7.5), and then (2.7.7) can be used to speed up the convergence. 
But the inherent uncertainty in the root due to the noise and the multiplicity will 
remain. This can be removed only by analytically reformulating the rootfinding 
problem as a new one in which the desired root « is simple. The easiest way to do 


Table 2.12 Modified Newton’s method (2.7.7), applied to (2.7.6) 


n xn f(x) a~ Xn 

0 1.22 —1.88E — 4 1.00E — 2 
1 1.2299734748 ~131E-9 2.65E — 5 
2 1.2299999998 —111E — 16 1.85E — 10 
3 1.2300003208 —1.92E — 13 —3.21E ~7 
4 1.2300000001 —1.11E — 16 —8.54E — 11 
5 1.2299993046 .-9.04E — 13 6.95E ~— 7 
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this is to form the ( p — 1)st derivative of f(x), and to then solve 
f(x) =0 (2.7.8) 


which will have a as a simple root. 


Example The previous example had a root of multiplicity p = 2. Then it is a 
simple root of 


f'(x) = 3x? — 11.12x + 9.1389 


Using the last iterate in Table 2.11 as an initial guess, and applying Newton’s 
method to finding the root of f’(x) just given, only one iteration was needed to 
find the value of a to the full precision of the computer. 


2.8 Brent’s Rootfinding Algorithm 


We describe an algorithm that combines the advantages of-the~bisection-method 
and the secant method, while avoiding the disadvantages of each of them. The 
algorithm is due to Brent (1973, chap. 4), and it is a further development of an 
earlier algorithm due to Dekker (1969). The algorithm results in a small interval 
that contains the root. If the function is sufficiently smooth around the desired 
root £, then the order of convergence will be superlinear, as with the secant 
method. 

In describing the algorithm we use the notation of Brent (1973, p. 47). The 
program is entered with two values, a, and bo, for which (1) there is at least one 
root ¢ of f(x) between ay and bo, and (2) f(a9)/(bo) < 0. The program is also 

‘entered with a desired tolerance +, from which a stopping tolerance 6 is 
produced: 


8 =1 + 2€|b} (2.8.1) 


with € the unit round for the computer [see (1.2.11) of Chapter 1]. 

In a typical step of the algorithm, b is the best current estimate of the root ¢, 
a is the previous value of b, and c is a past iterate that has been so chosen that 
the root { lies between b and c (initially c = a). Define m = 5(c — b). 

Stop the algorithm if (1) f(b) = 0, or (2) |m| < 6. In either case, set the 
approximate root { = b. For case (2), because’b will usually have been obtained 
by the secant method, the root ¢ will generally be closer to 6 than to c. Thus 
usually, 


Is-f| <8 
although all that can be guaranteed is that 


io — f} < 26 
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If the error test is not satisfied, set 


= b— f(b) (2.8.2) 


7 =e ) 


Then set 


bt+e 


bY = i if i lies between b and b+m= 


b+ m__ otherwise [which is the bisection method] 


In the case that a, b, and-c are distinct, the secant method in the definition of i 
is replaced by an inverse quadratic interpolation method. This results in a very 
slightly faster convergence for the overall algorithm. Following the determination 
of b”, define 


ame Ls if |b — b”| > 8 (2.8.3) 
b+68-sign(m) if |b—b"| <6 


If you are some distance from the root, then 6b’ = b”. With this choice, the - 
method is (1) linear (or quadratic) interpolation, or (2) the bisection method; 
usually it is (1) for a smooth function f(x). This generally results in a value of m 
that does not become small. To obtain a small interval containing the root §, 
once we are close to it, we use b’ := b + 8 - sign(m), a step of 6 in the direction 
of c. Because of the way in which a new c is chosen, this will usually result in a 
new small interval about ¢. Brent makes an additional important, but technical 
step before choosing a new b, usually the b’ just given. 

Having obtained the new b’, we set b = b’, a = the old value of b. If the sign 
of f(b), using the new b, is the same as with the old b, the value of c is 
unchanged; otherwise, c is set to the old value of b, resulting in a smaller interval 
about ¢. The accuracy of the value of b is now tested, as described earlier. 

Brent has taken great care to avoid underflow and overflow difficulties with his 
method, but the program is somewhat complicated to read as a consequence. 


Example Each of the following cases was computed on an IBM PC with 8087 


arithmetic coprocessor and single precision arithmetic satisfying the [EEE stan- 
dard for floating-point arithmetic. The tolerance was t = 10~°, and thus 


5 = 10-5 + 2b] x (5.96 x 10-8) 
= 1.01 x 1075 
since the root of £ = 1 in all cases. The functions were evaluated in the form in 


which they are given here, and in all cases, the initial interval was [a, b] = [0, 3]. 
The table values for b and c are rounded to seven decimal digits. 
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Table 2.13 Example 1 of 


Brent’s method 
b f(b) c 

0.0 —2.00E + 0 3.0 

0.5 —6.25E — 1 3.0 
7139038 -3.10E-1 3.0 
.9154507 —8.52E — 2 3.0 
.9901779 —9.82E — 3 3.0 
.9998567 —1.43E — 4 3.0 
.9999999 —1.19E - 7 3.0 
.9999999 —1.19E-7 1.000010 


Case (1) f(x) =(x — 191 +(x —-1)?]. The numerical results are given in 
Table 2.13. This illustrates the necessity of using b’ = b + 6 - sign(m) in order to 
obtain a small interval enclosing the root ¢. 


Case (2) f(x) =x*-— 1. The numerical results are given in Table 2.14. 


Case (3) f(x) = -14+x(3+x(-3+-xx)). The root is § = 1, of multiplicity 
three, and it took 50 iterations to converge to. the approximate root 1.000001. 
With the initial values a = 0, b = 3, the bisection method would use only 19 
iterations for the same accuracy. If Brent’s méthod is compared with the 
bisection method over the class of all continuous functions, then the number of 
necessary iterates for an error tolerance of 6 is approximately 


b-a 
log, (+) for bisection method 


b-—a)\|? 
oe. (~*)| for Brent’s algorithm 


Table 2.14 Example 2 of 
Brent’s method 


b f(b) c 


0.0 —1.00 3.0 
.3333333 —8.89E —1 3.0 
3333333 —8.89E —1 1.666667 
.T7771778 —4.00E — 1 1.666667 

1.068687 1.42E —- 1 TTTTITE 
9917336 —1.65E — 2 1.068687 
9997244 —5.51E — 4 1.068687 

1.000000 2.38E — 7 9997244 


1.000000 2.38E — 7 -9999900 
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Table 2.25 Example 4 of 


Brent’s method 

a eae |) eer 

0.0 —3.68E - 1 3.0 
.5731754 —1.76E — 3 3.0 
5959331 -1.63E — 3 3.0 
6098443 —S5.47E - 4 3.0 
6136354 ~4.76E — 4 1.804922 
.6389258 —168E — 4 1.804922 

1.221924 3.37E — 10 .6389258 

1.221914 3.37E — 10 6389258 

1.216585 1.20E — 10 6389258 
.9277553 0.0 1.216585 


Thus there are cases for which bisection is better, as our example shows. But for 
sufficiently smooth functions with f’(a) # 0, Brent’s algorithm is almost always 
far faster. 


Case (4) f(x) = (x — l)exp[-1/(x — 1)?]._ The root x = 1 has infinite mul- 
tiplicity, since f (1) = 0 for all r > 0. The numerical results are given in Table 
2.15. Note that the routine has found an exact root for the machine version of 
f(x), due to the inherent imprecision in the evaluation of the function; see the 
preceding section on multiple roots. This root is of course very inaccurate, but 
this is nothing that the program can treat. 


Brent’s original program, published in 1973, continues to be very popular and 
well-used. Nonetheless, improvements and extensions of it continue to be made. 
For one of them, and for a review of others, see Le (1985). 


2.9 Roots of Polynomials 
We will now consider solving the polynomial equation 
p(x) Zaytayxt---t+a,x"=0 a, #0 (2.9.1) 


This problem arises in many ways, and a large literature has been created to deal 
with it. Sometimes a particular root is wanted and a good initial guess is known. 
In that case, the best approach is to modify one of the earlier iterative methods to 
take advantage of the special form of polynomials. In other cases, little may be 
known about the location of the roots, and then other methods must be used, of 
which there are many. In this section, we just give a brief excursion into the area 
of polynomial rootfinding, without any pretense of completeness. Modifications 
of the methods of earlier sections will be emphasized, and numerical stability 
questions will be considered. We begin with a review of some results on bounding 
or roughly locating the roots of (2.9.1). 
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Location theorems Because p(x) is a polynomial, many results can be given 
about the roots of p(x), results that are not true for other functions. The best 
known of these is the fundamental theorem of algebra, which allows us to write 
p(x) as a unique product (except for order) involving the roots 


p(x) = a,(x — 2,) +++ (x — z,) (2.9.2) 


and z,,..., 2, are the roots of p(x), repeated according to their multiplicity. We 
now give some classical results on locating and bounding these roots. 

Descartes’s rule of signs is used to bound the number of positive real roots of 
p(x), assuming the coefficients ay,..., a, are all real. 


Let v be the number of changes of sign in the coefficients of p(x) in (2.9.1), 
ignoring the zero terms. Let k denote the number of positive real roots of 
p(x), counted according to their multiplicity. Then k <» and »—k is 
even. 


A proof of this is given in Henrici (1974, p. 442) and Householder (1970, p. 82). 


Example The expression p(x) = x® — x — 1 has »v = 1 changes of sign. There- 
fore, kK = 1; otherwise, k = 0, and »y — k = lis not-an-even integer, a contradic- 
tion. 


Descartes’s rule of signs can also be used to bound the number of negative 
roots of p(x). Apply it to the polynomial 


q(x) = p(-x) 


Its positive roots a the negative roots of p(x). Applying this to the last 
example, g(x) = x® + x — 1. Again there is one positive real root [oh q(x)), and 
thus one negative real root of p(x). -- is 

An upper bound for all of the roots of vey is given By the following: 


a. 


a, 


|z,|| << R=1+ Max (2.9.3) 


O<i<n-1 


This is due to Augustin Cauchy, in 1829, and a proof is given in Householder 
(1970, p. 71). Another such result of Cauchy is based on considering the 
polynomials 


ja,|[x"+ Ja,_y[x"7) + +++ +Ja,|x — Jag] = 0 (2.9.4) 


n-t 


la,|x" — |a,—3|* ~ |a,|x — |ao| = 0 (2.9.5) 


Assume that a, # 0, which is equivalent to assuming x = 0 is not a root of p(x). 
Then by Descartes’s law of signs, each of these polynomials has exactly one 
positive root; call them p, and p,, respectively. Then all roots z; of p(x) satisfy 


< |z,| <p, (2.9.6) 
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The proof of the upper bound is given in Henrici (1974, p. 458) and Householder 
(1970, p. 70). The proof of the lower bound can be based on the following 


approach, which can also be used in constructing a lower bound for (2.9.3). 
Consider the polynomial 


1 
q(x) = x"p(-] =a,+4,_,x +++ +a,x" + ax" a) #0 (2.9.7) 


Then the roots of g(x) are 1/z, where z is a root of p(x). If the upper bound 


result of (2.9.6) is-applied to (2.9.7), the lower bound result of (2.9.6) is obtained. 


We leave-this application to be shown in’a problem. 

Because each of the polynomials (2.9.4), (2.9.5) has a single simple positive 
root, Newton’s method can be easily used to construct R, and R,. As an initial 
guess, use the upper bound from (2.9.3) or experiment with smaller positive 
initial guesses. We leave the illustration of these results to the problems. 

There are many other results of the preceding type, and both Henrici (1974, 
chap. 6) and Householder (1970) wrote excellent treatises on the subject. 


Nested multiplication A very efficient way to evaluate the polynomial p(x) 
given in (2.9.1) is to use nested multiplication: 


p(x) = aot x(a, +x(a,+---+x(a,_,+4,x)++-) (2.9.8) 
With formula (2.9.1), there are n additions and 2m ~ 1 multiplications, and with 
(2.9.8) there are n additions and n multiplications, a considerable saving. 
For later work, it is convenient to introduce the following auxiliary coeffi- 
cients. Let b, = a,, 
b, =a, +2b.,,, k=n~1,n—2,...,0 (2.9.9) 
By considering (2.9.8), it is easy to see that 
p(z) = by (2.9.10) 
Introduce the polynomial 
q(x) = b, + box + +++ +b,x"7} (2.9.11) 
Then . 


by + (x — z)q(x) = by + (x - z)[b, + box + °°: +b,x"}] 


(by — bz) + (b, — byz)x + --> 
+(b,_, — b,z)x"-) + bx" 
=a) + a,x +++: +a,x" = p(x) 


p(x) = by + (x — z)q(x) (2.9.12) 
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where g(x) is the quotient and b, the remainder when p(x) is divided by x — z. 
The use of (2.9.9) to evaluate p(z) and to form the quotient polynomial q(x) is 
also called Horner’s method. 

If z is a root of p(x), then b,=0 and p(x) =(x — z)q(x). To find 
additional roots of p(x), we can restrict our search to the roots of g(x). This 
reduction process is called deflation; it must be used with some caution, a point 
we will return to later. 


Newton’s method If we want to apply Newton’s method to find a root of p(x), 
we must be able to evaluate both p(x) and p’(x) at any point z. From (2.9.12), 


p'(x) = (x —z)q'(x) + q(x) 
p'(z) = 4(z) (2.9.13) 


We use (2.9.10) and (2.9.13) in the following adaption of Newton’s method to 
polynomial rootfinding. 


Algorithm Polynew (a, n, Xo, €,itmax, root, b, ier) 
1. Remark: a is a vector of coefficients; -itmax the maximum 
number of iterates to be computed, b the vector of coefficients 


for the deflated polynomial, and jer an error indicator. 


2. itnum = 1 


4. Fork=n-—1,...,1, b, =a, + zby41, c= b, + z¢ 

5. by *= ay + 2b, 

6. If c = 0, ier := 2 and exit. 

7. X= Xq — b/c 

8. If |x, — xl < ¢, then ier = 0, root := x,, and exit. 

9. If itnum = itmax, then ier = 1 and exit. 

10. Otherwise, itnum := itnum + 1, Xo = x,, and go to step 3. 
Stability problems There are many polynomials in which the roots are quite 
sensitive to small changes in the coefficients. Some of these are problems with 
multiple roots, and it is not surprising that these roots are quite sensitive to small 
changes in the coefficients. But there are many polynomials with only simple 
roots that appear to be well separated, and for which the roots are still quite 


sensitive to small perturbations. Formulas are derived below that explain this 
sensitivity, and numerical examples are also given. 
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For the theory, introduce 
p(x) =agtayxt+-+++a,x" a, #0 
q(x) = by + byx + +++ +b, x" (2.9.14) 
and define a perturbation of p(x) by 
p(x; €) = p(x) + eg{x) (2.9.15) 
Denote. the-zeros of p(x; ¢) by z,(e),..., z,(€), repeated according to “their 
multiplicity, and let z, = z,(0), i= 1,..., 2, denote the corresponding n zeros of 
p(x) = p(x;9). It is well known that the zeros of a polynomial are continuous 
functions of the coefficients of the polynomial [see, for example, Henrici (1974, p. 
281)]. Consequently, z,(e) is a continuous function of «. What we want to 
determine is how rapidly the root z,(e) varies with ¢, for « near 0. 
Example Define 
p(x;e)=(x-1)-e€  p(x)=(x-1) € > 0 
Then the roots of p(x) are z; = z,; = z; = 1. The roots of p(x; €) are 
3 3 3 
z(e)=1ltve zfe)=ltw-ve 2,=1+07- Ve 


with w = 1(—1 + i¥3). For all three roots of p(x; €), 


Iz,(e) ~ 1] = ve 

To illustrate this, let « = .001. Then 
p(x; ¢€) = x — 3x7 + 3x — 1.001 

which is a relatively small change in p(x). But for the roots, 

Ize) -1p=.1 
a relatively large change in the roots z; = 1. 

We now give some more general estimates for z,(€) — z;. 

Case (1) x is a simple root of p(x), and thus p’(z,) # 0. Using the theory of 
functions of a complex variable, it is known that z,(€) can be written as a power 


series: 


z(e)=2,+ DL ve! (2.9.16) 
[a1 
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To estimate z,(e) — z,, we obtain a formula for the first term y,¢ in the series. To 
begin, it is easy to see that 


n= z;(0) 
To calculate z/(€), differentiate the identity 
p(z;(e)) + «q(z,(e)) =0 
which holds for all sufficiently small «. We obtain 
p'(z,(e))zf(e) + 9(z/(e)) + g'(z,(6)) zf(€) = 0 


Pie CC), 
Hl) Seto) «a GD) (2.9.17) 


Substituting « = 0, we obtain 


q(z,) 
Y = 2;(0) = -— 
I iC ) p (z;) 
Returning to (2.9.16), 
q(z;) se 
z(€)=2z,- , e+ YE 
iA / (z;) 2X 
q(z;) 
ze) — |z,- e|| < Ke? e| <e€ 2.9.18 
iC ) E p'(z;) | | 0 ( ) 


for some constants €) > 0 and K > 0. To estimate z,(e) for small ¢, we use 
- q(z;) 
P’( z;) 


The coefficient of « determines how rapidly z,(e) changes relative to e; if it is 
large, the root z; is called ill-conditioned. 


‘ | (2.9.19) 


zj(e) =z, 


Case (2) z, has multiplicity m > 1. By using techniques related to those used in 
1, we can obtain 


|z;(€) _ [z, + ye/"] <Klel?/™ el < eq (2.9.20) 
for some €) > 0, K > 0. There are m possible values to y,, given as the m 
complex roots of 


- m'q(z,) 


pf ea carer 
Os pla) 
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Example Consider the simple polynomial 
p(x) = (x= (x= 2) (2-7) 
= x7 — 28x6 + 322x5 — 1960x* + 6769x? — 13132x? 
+13068x — 5040 (2.9.21) 


For the perturbation, take 
g(x) =x «= —.002 


Then for the root z; = j, 
o(z)=TIG-) ale) =7° 
j 


From (2.9.19), we have the estimate 


.002j%(—1)/7? 
zfe) =f *G-iG—pi i+ i) (2.9.22) 


The numerical values of 6(j) are given in Table 2.16. The relative error in the 
coefficient of x® is 002/28 = 7.1E — 5, but the relative errors in the roots are 
much larger. In fact, the size of some of the perturbations 5(j) casts doubt on 
the validity of using the linear estimate (2.9.22). The actual roots of p(x) + q(x) 
are given in Table 2.17, and they correspond closely to the predicted perturba- 
tions. The major departure is in the roots for j = 5 and j = 6. They are complex, 
which was not predicted by the linear estimate (2.9.22). In these two cases, € is 
outside the radius of convergence of the power series (2.9.16), since the latter will 
only have the real coefficients : 


1 
n= 520) 


obtained from differentiating (2.9.17). 


Table 2.16 Values of 5( /) from (2.9.22) 
§(j) 
2.78E — 6 
—1L07E — 3 
3.04E — 2 
—2.28E — 1 
6.51E —1 
—71.7TE -1 
3.27E — 1 
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Table 2.17 Roots of p(x; ) for (2.9.21) 


Jj z;(€) 2;(€) iz z;(0) 
1 — 1.0000028 2.80E — 6 
2  1.9989382 —1.06E - 3 
3 3.0331253 3.31E — 2 
4 3.8195692 —1.80E -1 
5 5.4586758 + .54012578i - 

6  5.4586758 — 54012578 

7  ~—7.2330128 2.33E—1 


We say that a polynomial whose roots are unstable with respect to small 
relative changes in the coefficients is ill-conditioned. Many such polynomials 
occur naturally in applications. The previous example should illustrate the 
difficulty in determining with only a cursory examination whether or not a 
polynomial is ill-conditioned. 


Polynomial deflation Another problem-occurs with deflation of a polynomial to 
a lower degree polynomial, a process defined following (2.9.12). Since the zeros 
will not be found exactly, the lower degree polynomial (2.9.11) found by 
extracting the latest root will generally be in error in all of its coefficients. Clearly 
from the past example, this can cause a significant perturbation in the roots for 
some classes of polynomials. Wilkinson (1963) has analyzed the effects of 
deflation and has recommended the following general strategy: (1) Solve for the 
roots of smallest magnitude first, ending with those of largest size; (2) after 
obtaining approximations to all roots, iterate again using the original polynomial 
and using the previously calculated values as initial guesses. A complete discus- 
sion can be found in Wilkinson (1963, pp. 55-65). 


Example Consider finding the roots of the degree 6 Laguerre polynomial 
p(x) = 720 — 4320x + 5400x? — 2400x? + 450x4 — 36x° + x® 


The Newton algorithm of the last section was used to solve for the roots, with 
defiation following the acceptance of each new root. The roots were calculated in 
two ways: (1) from largest to smallest, and (2) from smallest to largest. The 
calculations were in single precision arithmetic on an IBM 360, and the numeri- 
cal results are given in Table 2.18. A comparison of the columns headed Method 
(1) and Method (2) shows clearly the superiority of calculating the roots in the 
order of increasing magnitude. If the results of method (1) are used as initial 
guesses for further iteration with the original polynomial, then approximate roots 
are obtained with an accuracy better than that of method (2); see the column 
headed Method (3) in the table. This table shows the importance of iterating 
again with the original polynomial to remove the effects of the deflation process. 


102 ROOTFINDING FOR NONLINEAR EQUATIONS 


Table 2.18 Example involving polynomial deflation 


True Method (1) Method (2) Method (3) 
15.98287 15.98287 15.98279 15.98287 
9.837467 9.837471 9.837469 9.837467 
5.775144 5.775764 5.775207 5.775144 
2.992736 2.991080 2.992710 2.992736 
1.188932 1.190937 1.188932 1.188932 
2228466 .2219429 .2228466 2228466 


There are other ways to deflate a polynomial, one of which favors finding roots 
of largest magnitude first. For a complete discussion see Peters and Wilkinson 
(1971, sec. 5). An algorithm is given for composite deflation, which removes the 
need to find the roots in any particular order. In that paper, the authors also 
discuss the use of implicit deflation, 


_ p(x) 
Ne (a2) (ae =2,) 


to remove the roots z,,...,z, that have been computed previously. This was 
given earlier, in (2.4.4), where it was used in connection with Muller’s method. 


General polynomial rootfinding methods There are a large number of rootfind- 
ing algorithms designed especially for polynomials. Many of these are taken up in 
detail in the books Dejon and Henrici (1969), Henrici (1974, chap. 6), and 
Householder (1970). There are far too many types of such methods to attempt to 
describe them all here. 

One large class of important methods uses location theorems related to those 
described in (2.9.3)—(2.9.6), to iteratively separate the roots into disjoint and ever 
smaller regions, often circles. The best known of such methods is probably the 
Lehmer—Schur method [see Householder (1970. sec. 2.7)]. Such methods converge 
linearly, and for that reason, they are ofters combined with some more rapidly 
convergent method, such as Newton’s method. Once the roots have been sep- 
arated into distinct regions, the faster method is applied to rapidly obtain the 
root within that region. For a general discussion of such rootfinding methods, see 
Henrici (1974, sec. 6.10). 

Other methods that have been developed into widely used algorithms are the 
method of Jenkins and Traub and the-method of Laguerre. For the former, see 
Householder (1970, p. 173), Jenkins and Traub (1970), (1972). For Laguerre’s 
method, see Householder (1970, sec. 4.5) and Kahan (1967). 

Another easy-to-use numerical method is based on being able to calculate the 
eigenvalues of a matrix. Given the polynomial p(x), it is possible to easily 
construct a matrix with p(x) as its characteristic polynomial (see Problem 2 of 
Chapter 9). Since excellent software exists for solving the eigenvalue problem, 
this software can be used to find the roots of a polynomial p(x). 
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2.10 Systems of Nonlinear Equations 


This section and the next are concerned with the numerical solution of systems of 
nonlinear equations in several variables. These problems are widespread in 
applications, and they are varied in form. There is a great variety of methods for 
the solution of such systems, so we only introduce the subject. We give some 
general theory and some numerical methods that are easily programmed. To do a 
complete development of the numerical analysis of solving nonlinear systems, we 
would need a number of results from numerical linear algebra, which is not taken 
up until Chapters 7-9. 

For simplicity of presentation and ease of understanding, the theory is 
presented for only two equations: 


fil™,%2)=0 f(x, x,) =0 (2.10.1) 


The generalization to n equations in n variables should be straightforward once 
the principal ideas have been grasped. As an additional aid, we will simulta- 
neously consider the solution of (2.10.1) in vector notation: 


aid a es) 


The solution of (2.10.1) can be looked upon as a two-step process: (1) Find the 
zero curves in the x,x,-plane of the surfaces z = f,(x,,x,) and z = f,(x,, x2), 
and (2) find the points of intersection of these zero curves in the x,x,-plane. This 
perspective is used in the next section to generalize Newton’s method to solve 
(2.10.1). 


(2.10.2) 


Fixed-point theory We begin by generalizing some of the fixed-point iteration 
theory of Section 2.5. Assume that the rootfinding problem (2.10.1) has been 
reformulated in an equivalent form as 


y= 81(%1, x2) Xo = 2(%,, x2) (2.10.3) 


Denote its solution by 


We study the fixed-point iteration 


Xi n+) ors Bias x2») X2 n+ = 82(%1.n xn) (2.10.4) 
Using vector notation, we write this as 


Xn+1 = B(Xx,,) (2.10.5) 
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Table 2.19 Example (2.10.7) of fixed-point iteration 


n Xn Xo,» Si(%1, ns %2,n) fx(%\, ns Xa, 0) 
0 —.5 25 0.0 1.56E — 2 
1 — 497343750 .254062500 2.43E — 4 5.46E — 4 
2 — 497254794 254077922 9.35E — 6 2.12E — 5 
3 — 497251343 .254078566 3.64E — 7 8.26E — 7 
4 —~ 497251208 .254078592 1.50E — 8 3.30E — 8 
with 


ee Ea aes a 


*2, 82(x,, X2) 
Example Consider solving 
fy =3xp+4x3-1=0 fp =x} - 8x3 -1=0 (2.10.6) 
for the solution a near (x, x.) = (—.5,.25). We solve this system iteratively with 
as 7 ia = ee le + 4x3 — i bana 
Teds Ky. 52 -.26|| x3,,-8xi,-1 
The origin of this reformulation of (2.10.6) is given later. The numerical results of 


(2.10.7) are given in Table 2.19. Clearly the iterates are converging rapidly. 


To analyze the convergence of (2.10.5), begin by subtracting the two equations 
in (2.10.4) from the corresponding equations 


ay = £,(a), a) a= 8( a4, a) 


involving the true solution «. Apply the mean value theorem for functions of two 
variables (Theorem 1.5 with n = 1) to these differences to obtain 


dg,(€0,, &,) 
Ox, 


ag,( &(°,, (),) 
(a, — x1) + - Ay (a) — Xz, n) 
2 


a; — Xi ntl = 


for i = 1,2. The points &) = (£{?,, $9.) are on the line segment joining a and 
x, In matrix form, these error equations become 


ag (6)  dg,(&) 
la = ae 2 OX, dx, 5 Xin 


Ag,(&) ag,(E@) a — ae (2.10.8) 
Ox, Oxy 
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Let G, denote the matrix in (2.10.8). Then we can rewrite this equation as 
a—X,4, = G,(a—-x,) (2.10.9) 


It is convenient to introduce the Jacobian matrix for the functions g, and g,: 


dg, (x) dg, (x) 


ax, OX 
ag,(x) g(x) 


Ox, OX, 


G(x) = (2.10.10) 


In (2.10.9), if x,, is close to a, then G,, will be close to G(a). This will make the 
size or norm of G(«) crucial in analyzing the convergence in (2.10.9). The matrix 
G(a) plays the role of g’(a) in the theory of Section 2.5. To measure the size of 
the errors a — x, and of the matrices G, and G(«), we use the vector and matrix 
norms of (1.1.16) and (1.1.19) in Chapter 1. 


Theorem 2.9 Let D be a closed, bounded, and convex set in the plane. (We say 
D is convex if for any two points in D,-the line segment joining 
them is also in D.) Assume that the components of g(x) are 
continuously differentiable at all points of D, and further assume 


lL g(D) CD, (2.10.11) 
2. X= Max||G(x)I[,, < 1 (2.10.12) 
xED 
Then 


(a) x = g(x) has a unique solution a € D. 


(b) For any initial point x, © D, the iteration (2.10.5) will 
converge in D to a. 


(ec) la — X, 4 rllo S MEC@)Ilo + en die — Xnlleo 
(2.10.13) 


with e, > 0 as n— oo. 
Proof (a) The existence of a fixed point a can be shown by proving that the 
sequence of iterates {x,} from (2.10.5) are convergent in D. We leave 


that to a problem, and instead just show the uniqueness of «. 
Suppose a and 6 are both fixed points of g(x) in D. Then 


a — B = g(a) — g(B) (2.10.14) 
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Apply the mean value theorem to component i, obtaining 
g(a) — (B) = va(€)(a-B)  i=1,2 (2.10.15) 
with 
0g; 9g; 


and & © D, on the-line-segment joining.«-and B. Since ||G(x)||,, <A 
< 1, we have from-the-definition ‘of -the-norm that 


dg;,(x) 


ax, 


dg,(x) 


<i <1, x€D, i=1,2 
Ox, 


Combining this with (2.10.15), 
1g;(«) — g,(B)| < Alla — BIL, 
le(e) — g(B) II, < Alle — Bile (2.10.16) 
Combined with (2.10.14), this yields 
lle — Bile < Allo — Bil. 
which is possible only if « = B, showing the uniqueness of « in D. 


(b) Condition (2.10.11) will ensure that all x, © D if xy © D. Next 
subtract x,,,, = 9(x,,) from « = g(a), obtaining 


a@— X,41 = B(a) — B(x,) 
The result (2.10.16) applies to any two points in D. Applying this, 

Ilo — X patil SAO® — Xylleo (2.10.17) 
Inductively. 

]& — Xplleo SAI — Xalleo (2.10.18) 
Since A < 1, this shows x, > a as n > 00. 
(ec) From (2.10.9) and using (1.1.21), 

Hl = Xp leo S lGalleollet — X alle (2.10.19) 

As n — 00, the points &‘) used in evaluating G,, will all tend to «, since 
they are on the line segment joining x, and a. Then |/G,|[,, > |IG(@)I|,, 


as n— oo. Result (2.10.13) follows from (2.10.19) by letting 
€n = WGylleo ~ WGC) Ilo: 
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The preceding theorem is the generalization to two variables of Theorem 2.6 
for functions of one variable. The following generalizes Theorem 2.7. 


Corollary 2.10 Let a be a fixed point of g(x), and assume components of g(x) 
are continuously differentiable in some neighborhood about a. 
Further assume 


WG(a)ij. <1 (2.10.20) 


Then for x, chosen sufficiently close to a, the iteration x,., = 
g(x,,) will converge to a, and the results of Theorem 2.9 will be 
valid on some closed, bounded, convex region about «a. a 


We leave the proof of this as a problem. Based on results in Chapter 7, the 
linear convergence of x,, to a will still be true if all eigenvalues of G(«) are less 
than one in magnitude, which can be shown to be a weaker assumption than 
(2.10.20). 


Example Continue the earlier example (2.10.7). It is straightforward to compute 


= | 038920 .000401 
Gla) = | o98599 ae 
and therefore 
IG(«)||,. = 0393 


Thus the condition (2.10.20) of the theorem is satisfied. From (2.10.13), it will be 
approximately true that 


Ile = Xn +alleo 


Ila — X allo 


<||G,||. = .0393 


for all sufficiently large n. 


Suppose that A is a constant nonsingular matrix of order 2 x 2. We can then 
reformulate (2.10.1) as 


x =x + Af(x) = g(x) (2.10.21) 


The example (2.10.7) illustrates this procedure. To see the requirements on A, we 
produce the Jacobian matrix. Easily, 


G(x) = 1 + AF(x) 


where F(x) is the Jacobian matrix of f, and f,, 


af, (x) Of, (x) 


F(x) = of 2 (2.10.22) 
f(x) — 9f,(x) 
ax, Ox 


We want to choose A so that (2.10.20) is satisfied. And for rapid convergence, we 
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want ||G(a)||,, = 9, or 
A= —F(a)' 
_ The matrix in (2.10.7) was chosen in this way using 
A= —F(Xo) 


This suggests using a continual updating of A, say A = — F(x,,)~!. The resulting 
method is 


Xna1 =X, — F(x,) f(x,) 1 20 (2.10.23) 


We consider this method in the next section. 


2.11 Newton’s Method for Nonlinear Systems 


As with Newton’s method for a single equation, there is more than one way of 
viewing and deriving the Newton method for solving a system of nonlinear 
equations. We begin with an analytic derivation, and then we give a geometric 
perspective. 

Apply Taylor’s theorem for functions of two variables to each of the equations 
f(x, x.) = 0, expanding f,(«) about x9: for / = 1,2 


Af, (xo) Af,(xo) 


+ —. 
ax, (a, X2,0) ax 


0 = f,(a) = fi(x9) + (a; — X1,0) 


: g ay (i) 2.11.1 
ai ka Fie) 5 + ia Eaa) Be ACEO) (2.11.1) 


with € on the line segment joining x, and a. If we drop the second-order 
terms, we obtain the approximation 


a a 
0 = f,(Xo) + (a — X19) a + (a, - X2,0) ace 


) a 
0 = f2(xg) + (a, - nae + (a, - ne (2.11.2) 


In matrix form, 
0 = f(xy) + F(x9)(a — Xo) (2.11.3) 


with F(x,) the Jacobian matrix of f, given in (2.10.22). 
Solving for a, 


@ = Xo — F(x9)"'f(Xo) =x 
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The approximation x, should be an improvement on Xo, provided xq is chosen 
sufficiently close to a. This leads to the iteration method first obtained at the end 
of the last section, 


Xna1 =X, — F(x,) f(x,) 220 (2.11.4) 


This is Newton’s method for solving the nonlinear system f(x) = 0. 
In actual practice, we do not invert F(x,,), particularly for systems of more 
than two equations. Instead we solve a linear system for a correction term to x,,: 


F(x,) 8,41 = —f(x,) 


Xna1 =X, + 8,42 (2.11.5) 


n 


This is more efficient in computation time, requiring only about one-third as 
many operations as inverting F(x,,). See Sections 8.1 and 8.2 for a discussion of 
the numerical solution of linear systems of equations. 

There is a geometrical derivation for Newton’s method, in analogy with the 
tangent line approximation used with single nonlinear equations in Section 2.2. 
The graph in space of the equation 


Af(xo) Af;(Xo) 


ax, * Gao x20) ax; = p;(x, x2) 


z= fi(xo) + (4 - X1,0) 


is a plane that is tangent to the graph of z = f,(x,, x,) at the point xq, for 
i = 1,2. If xq is near a, then these tangent planes should be good approximations 
to the associated surfaces of z = f,(x,, x2), for x = (x,, x2) near a. Then the 
intersection of the zero curves of the tangent planes z = p;(x,, x2) should be a 
good approximation to the corresponding intersection a of the zero curves of the 
original surfaces z = f,(x,, x2). This results in the statement (2.11.2). The inter- 
section of the zero curves of z = p;(x,, X2), i= 1,2, is the point x,. 


Example Consider the system 
fr =4x74+x3-4=0 fp=x, +x, —-sin(x,-x,)=0 


There are only two roots, one near (1, 0) and its reflection through the origin near 
(—1,0). Using (2.11.4) with x, = (1,0), we obtain the results given in Table 2.20. 


Table 2.20 Example of Newton’s method 


n Xin ‘ X20 A,) AiX&,) 

0 1.0 0.0 0.0 1.59E — 1 
1 1.0 — .1029207154 1.06E — 2 4.55E — 3 
2 -9986087598 —.1055307239 1.46E — 5 6.63E — 7 
3 .9986069441 — 1055304923 1.32E — 11 1.87E — 12 
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Convergence analysis For the convergence analysis of Newton’s method (2.11.4), 
regard it as a fixed-point iteration method with 


g(x) =x — F(x) ‘f(x) (2.11.6) 
Also assume 


Determinant F(a) ¥ 0 


which is the analogue of assuming a is a simple root when dealing with a single 
equation,.as in Theorem 2.1. It can then be shown that the Jacobian G(x) of 
(2.11.6) is zero at x = @ (see Problem 53); consequently, the condition (2.10.20) is 
easily satisfied. 

Corollary 2.10 then implies that x, converges to a, provided x is chosen 
sufficiently close to a. In addition, it can be shown that the iteration is quadratic. 
Specifically. the formulas (2.11.1) and (2.11.4) can be combined to obtain 


la-x,4il1. < Blla-x, [2 n>0 (2.11.7) 
for some constant B > 0. 


Variations of Newton’s method Newton’s method has both advantages and 
disadvantages when compared with other methods for solving systems of nonlin- 
ear equations. Among its advantages, it is very simple in form and there is great 
flexibility in using it on a large variety of problems. If we do not want to bother 
supplying partial derivatives to be evaluated by a computer program, we can use 
a difference approximation. For example, we commonly use 


If;( x1, X2) at fil + € x2) ~ fil%y. x2) (2.11.8) 

ax, € es 
with some very small number e. For a detailed discussion of the choice of €, see 
Dennis and Schnabel (1983, pp. 94--99). 

The first disadvantage of Newton’s method is that there are other methods 
which. are (1) less expensive to use, and/or (2) easier to use for some special 
classes of problems. For a system of m nonlinear equations in m unknowns, each 
iterate for Newton’s method requires m? + m function evaluations in general. In 
addition, Newton’s method requires the solution of a system of m_ linear 
equations for each iterate, at a cost of about 3m? arithmetic operations per linear 
system. There are other methods that are as fast or almost as fast in their 
mathematical speed of convergence, but that require fewer function evaluations 
and arithmetic operations per iteration. These are often referred to as Newton-like, 
quasi-Newton, and modified Newton methods. For a general presentation of many 
of these methods, see Dennis and Schnabel (1983). 

A simple modification of Newton’s method is to fix the Jacobian matrix for 
several steps, say k: 


Xeea ge = Xrkay F(x,4) (X44) JP Vi lsick 1. Wit) 
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for r= 0,1,2,... . This means the linear system in 
F(X 4) 8 4j41 a4 f(x,,4;) 
Keeayer = Xeeay + Bes jet, j=0,1,...,4—-1, (2.11.10) 


can be solved much more efficiently than in the original Newton method (2.11.5). 
The linear system, of order m, requires about 4m? arithmetic operations for its 
solution in the first case, when j = 0. But each subsequent case, j = 1,...,k — 1, 
will require only 2m? arithmetic operations for its solution. See Section 8.1 for 
more complete details. The speed of convergence of (2.11.9) will be slower than 
the original method (2.11.4), but the actual computation time of the modified 
method will often be much less. For a more detailed examination of this 
question, see Potra and Ptak (1984, p. 119). 

A second problem with Newton’s method, and with many other methods, is 
that often xg must be reasonably close to a in order to obtain convergence. 
There are modifications of Newton’s method to force convergence for poor 
choices of x9. For example, define 


Xa1=X,+sd,  d,= —F(x,)” f(x,) (2.11.11) 


and choose s > 0 to minimize 
f(x, +54, [b= ¥ [Glx, + 54,)] (2.11.12) 
jel 


The choice s = 1 in (2.11.11) yields Newton’s method, but it may not be the best 
choice. In some cases, s may need to be much smaller than 1, at least initially, in 
order to ensure convergence. For a more detailed discussion, see Dennis and 
Schnabel (1983, chap. 6). . 
For an analysis of some current programs for solving nonlinear systems, see 
Hiebert (1982). He also discusses the difficulties in producing such software. 


2.12 Unconstrained Optimization 


Optimization refers to finding the maximum or minimum of a continuous 
function f(x,,-..,X,,)- This is an extremely important problem, lying at the 
heart of modern industrial engineering, management science, and other areas. 
This section discusses some methods and perspectives for calculating the mini- 
mum or maximum of a function f(x,,.-.,X,,)- No formal algorithms are given, 
since this would require too extensive a development. 

Vector notation is used in much of the presentation, to give results for a 
general number m of variables. We consider only the unconstrained optimization 
problem, in which there are no limitations on (x,,..., x,,). For simplicity only, 
we also assume f(x,,..., X,,) iS defined for all (x,,..., x,,)- 

Because the behavior of a function f(x) can be quite varied, the problem must 
be further limited. A point « is called a strict local minimum of f if f(x) > f(a) 
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for all x close to a, x # a. We limit ourselves to finding a strict local minimum of 
f(x). Generally an initial guess x, of « will be known, and f(x) will be assumed 
to be twice continuously differentiable with respect to its variables x,,..., Xn. 


Reformulation as a nonlinear system With the assumption of differentiability, a 
necessary condition for « to be a strict local minimum is that 


df (a) 
=0 P= Wyte 12. 
ax, i m (2.12.1) 
Thus the nonlinear system 
of (x 
ft) =0 i=1,...,m (2.12.2) 
Ox; 


i 


can be solved, and each calculated solution can be checked as to whether it is a 
local maximum, minimum, or neither. For notation, introduce the gradient vector 


of 
Ox, 
of 
OX mm 
Using this vector, the system (2.12.2) is written more compactly as 
Vi(x) =0 (2.12.3) 


To solve (2.12.3), Newton's method (2.11.4) can be used, as well as other 
rootfinding methods for nonlinear systems. Using Newton’s method leads to 


Xai =X, — H(x,) Vf(x,) 220 (2.12.4) 
with H(x) the Hessian matrix of f, 


d*f(x) 
Ox;Ox,’ 


d 


A(x) i; = 


<i,j<m 


If « is a strict local minimum of f, then Taylor’s theorem (1.1.12) can be used to 
show that H(«) is a nonsingular matrix; then H(x) will be nonsingular for x 
close to a. For convergence, the analysis of Newton’s method in the preceding 
section can be used to prove quadratic convergence of x,, to « provided xq is 
chosen sufficiently close to «. 

The main drawbacks with the iteration (2.12.4) are the same as those given in 
the last section for Newton’s method for solving nonlinear systems. There are 
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other, more efficient optimization methods that seek to approximate a by using 
only f(x) and V/f(x). These methods may require more iterations, but generally 
their total computing time will be much less than with Newton’s method. In 
addition, these methods seek to obtain convergence for a larger'set of initial 
values Xo. 


Descent methods Suppose we are trying to minimize a function /{(x). Most 
methods for doing so are based on. the following general two-step iteration 
process. 


STEP Di: At x,, pick a direction d, such that f(x) will decrease as x moves 
away from x,, in the direction d,. 


STEP D2: Let x,,, =x, + 5d,, with s chosen to minimize 
o(s)=f(x,+sd,), 520 (2.12.5) 


Usually s is chosen as the smallest positive relative minimum of 
p(s). 


Such methods are called ‘descent :methods.-With-each -iteration, 


F(X p41) <f(X,) 


Descent methods are guaranteed to converge under more general conditions 
than for Newton’s method (2.12.4). Consider the level surface 


C = {x| f(x) = f(Xo)} 


and consider only the connected portion of it, say C’, that contains xy. Then if 
C’ is bounded and contains « in its interior, descent methods will converge under 
very general conditions. This is illustrated for the two-variable case in Figure 2.8. 
Several level curves f(x,, x.) = ¢ are shown for a set of values c approaching 
f(a). The vectors d, are directions in which f(x) is decreasing. 

There are a number of ways. for choosing the directions d,, and the best 
known are as follows. 


1. The method of steepest descent. Hered, = —V/(x,,). It is the direction in 
which f(x) decreases most rapidly when moving away from x,. It is a good 
Strategy near x,, but it ‘usually turns out to be a poor strategy for rapid 
convergence to a. 

2. Quasi-Newton methods. These methods can be viewed as approximations 
of Newton’s method (2.12.4). They use easily computable approximations of 
H(x,) or H(x,)~1, and they are also descent methods. The best known 
examples are the Davidon-Fletcher-Powell method and the Broyden 
methods. 

3. The conjugate gradient method. This uses a generalization of the idea of an 
orthogonal basis for a vector space to generate the directions d,, with the 
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f(x) = eg 


f(x) =e, 


Figure 2.8 [Illustration of steepest 
descent method. 


directions related in an optimal way to the function f(x) being minimized. 
In Chapter 8, the conjugate gradient method is used for solving systems of 
linear equations. 


There are many other approaches to minimizing a function, but they are too 
numerous to include here. As general references to the preceding ideas, see 
Dennis and Schnabel (1983), Fletcher (1980), Gill et al. (1981), and Luenberger 
(1984). An important and very different approach to minimizing a function is the’ 
simplex method given in Nelder and Mead (1965), with a discussion given in Gill 
et al. (1981, p. 94) and Woods (1985, Chap. 2). This method uses only function 
values (no derivative values), and it seems to be especially suitable for noisy 
functions. 

An important project to develop programs for solving optimization problems 
and nonlinear systems is under way at the Argonne National Lzboratory. The 
program package is called MINPACK, and version 1 is available [see Moré et al. 
(1980) and Moré et al. (1984)]. It contains routines for nonlinear systems and 
nonlinear least squares problems. Future versions are intended to include pro- 
grams for both unconstrained and constrained optimization problems. 


Discussion of the Literature 


There is a large literature on methods for calculating the roots of a single 
equation. See the books by Householder (1970), Ostrowski (1973), and Traub 
(1964) for a more extensive development than has been given here. Newton’s 
method is one of the most widely used methods, and its development is due to. 
many people. For an historical account of contributions to it by Newton, 
Raphson, and Cauchy, see Goldstine (1977, pp. 64, 278). 

For computer programs, most people still use and individually program a 
method that is especially suitable for their particular application. However, one 
should strongly consider using one of the general-purpose programs that have 
been developed in recent years and that are available in the commercial software 
libraries. They are usually accurate, efficient, and easy to use. Among such 
general-purpose programs, the ones based on Brent (1973) and Dekker (1969) has 
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been most popular, and further developments of them continue to be made, as in 
Le (1985). The IMSL and NAG computer libraries include these and other 
excellent rootfinding programs. 

Finding the roots of polynomials is an extremely old area, going back to at 
least the ancient Greeks. Tnere are many methods and a large literature for them, 
and many new methods have been developed in the past 2 to 3 decades. As an 
introduction to the area, see Dejon and Henrici (1969), Henrici (1974, chap. 6), 
Householder (1970), Traub (1964), and their bibliographies. The article by 
Wilkinson (1984) shows some of the practical difficulties of solving the poly- 
nomial rootfinding problem on a computer. Accurate, efficient, automatic, and 
reliable computer programs have been produced for finding the roots of poly- 
nomials. Among such programs are (a) those of Jenkins (1975), Jenkins and 
Traub (1970), (1972), and (b) the program ZERPOL of Smith (1967), based on 
Laguerre’s method [see Kahan (1967), Householder (1970, p. 176)}. These auto- 
matic programs are much too sophisticated, both mathematically and algorithmi- 
cally, to discuss in an introductory text such as this one. Nonetheless, they are 
well worth using. Most people would not be able to write a program that would 
be as competitive in both speed and accuracy. The latter is especially important, 
since the polynomial rootfinding problem can be very sensitive to rounding 
errors, aS was shown in examples earlier in- the chapter. 

The study of numerical methods for solving systems of nonlinear equations 
and optimization problems is currently a very popular area of research. For 
introductions to numerical methods for solving nonlinear systems, see Baker and 
Phillips (1981, pt. 1), Ortega and Rheinboldt (1970), and Rheinboldt (1974). For 
generalizations of these methods to nonlinear differential and integral equations, 
see Baker and Phillips (1981), Kantorovich (1948) {a classical paper in this areal, 
Kantorovich and Akilov (1964), and Rall (1969). For a survey of numerical 
methods for optimization, see Dennis (1984) and Powell (1982). General intro- 
ductions are given in Dennis and Schnabel (1983), Fletcher (1980), (1981), Gill 
et al. (1981), and Luenberger (1984). As an example of recent research in 
optimization theory and in the development of software, see Boggs et al. (1985). 
For computer programs, see Hiebert (1982) and Moré et al. (1984). 
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Problems 


1. The introductory examples for f(x) = a — (1/x) is related to the infinite 
product 


a +r) = Limit [(1 +r). + r?)(1 + r4)--- +r”) 
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By using formula (2.0.6) and (2.0.9), we can calculate the value of the 
infinite product. What is this value, and what condition on r is required for 
the infinite product to converge? Hint: Let r = rp, and write x, in ‘terms of 
Xo and ro. 


2. Write a program implementing the algorithm Bisect given in Section 2.1. 


Use the program to calculate the real roots of the following equations. Use 
an error tolerance of e = 107°. 


(a) e*—3x*=0 (b) x? =x?4+x41 (ec) ex= 


1+ x? 


(d) x= 1+ 3cos(x) 


3. Use the program from Problem 2 to calculate (a) the smallest positive root 
of x — tan(x) = 0, and (b) the root of this equation that is closest to 
x = 100. 


4. Implement the algorithm Newton given in Section 2.2. Use it to solve the 
equations in Problem 2. 


5. Use Newton’s method to calculate the roots requested in Problem 3. 
Attempt to explain the differences in finding the roots of parts (a) and (b). 


6. Use Newton’s method to calculate the unique root of 


— Bx? 


x+e cos(x) = 0 


with B > 0 a parameter to be set. Use a variety of increasing values of B, 
for example, B = 1,5,10,25,50. Among the choices of x. used, choose 
X9 = 0 and explain any anomalous behavior. Theoretically, the Newton 
method will converge for any value of x) and B. Compare this with actual 
computations for larger values of B. 


7. An interesting polynomial rootfinding problem occurs in the computation 
of annuities. An amount of P, dollars is put into an account at the 


beginning of years 1,2,..., N;. It is compounded annually at a rate of r 
(e.g., r = .05 means a 5 percent rate of interest). At the beginning of years 
N, + 1,..., Ny + N, a payment of P, dollars is removed from the account. 


After the last payment, the account is exactly zero. The relationship of the 
variables is 


pastry’ —-af=P,f1-a+r)"] 


If N, = 30, N, = 20, P, = 2000, P, = 8000, then what is r? Use a 
rootfinding method of your choice. 


8. Use the Newton—Fourier method to solve the equations in Problems 2 
and 6. 


ei gatas 


10. 


12. 


13. 
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Use the secant method to solve the equations given in Problem 2. 
Use the secant method to solve the equation of Problem 6. 


Show the error formula (2.3.2) for the secant method, 


fla, 5, «] 


eG AD) aay: 


Consider Newton’s method for finding the positive square root of a > 0. 
Derive the following results, assuming x, > 0, xo # Va. 


1 a 
(a) Xn41 = s(t + =] 


2 x, 
, _._ [ref ss 
(b) x54, 74 5 n> 0, and thus x, > va for all n > 0. 
Xn 


(c) The iterates {x,} are a strictly decreasing sequence for n > 1. Hint: 
Consider the sign of x,,, — X,. 


(d) e¢,4,, = —e2/(2x,), with e, = va — x,, 


@ 2 
Rel (x,4;) = Fy Rel (=)] n>0 


with Rel(x,) the relative error in x,,. 
(e) If x9 > va and |Rel(x,)| < 0.1, bound Rel(x,). 


Newton’s method is the commonly used method for calculating square 
roots on a computer. To use Newton’s method to calculate Va, an initial 
guess x, must be chosen, and it would be most convenient to use a fixed 
number of iterates rather than having to test for convergence. For definite- 
ness, suppose that the computer arithmetic is binary and that the mantissa 
contains 48 binary bits. Write 


a=4-2° 12a 1 
This can be easily modified to the form 
a=b-2 i<b<l 


with f an even integer. Then 


Va =yb-247 Leyb <1 


and the number ya will be in standard floating-point form, once vb is 
known. ; 
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14. 


15. 


16. 


17. 


This reduces the problem to that of calculating yb for i<b<1. Use 
the linear interpolating formula 


Xo=4(2b+1) 4eb81 


as an initial guess for the Newton iteration for calculating Yb. Bound the 
error Yb ~ xo. Estimate how many iterates are necessary in order that 


0<x,-Vvb'<2-* 


which is the limit of machine precision for-b on a particular computer. 
[Note that the effect of rounding errors is being ignored]. How might the 
choice of x» be improved? 


Numerically calculate the Newton iterates for solving x? — 1 = 0, and use 
X9 = 100,000. Identify and explain the resulting speed of convergence. 


(a) Apply Newton’s method to the function 


(Ree et 


with the root a = 0. What is the behavior of the iterates? Do they 
converge, and if so, at what rate? 


(b) Do the same as in (a), but with 


3 
2 
x x>0 
f(z) = vx 
~ Vx? x <0 
A sequence {x,} is said to converge superlinearly to a if 
|a — Xy44] <c,|a — x,| n>=0 


with c, — 0 as n — oo. Show that in this case, 


Thus Ja — x,| = |X,41 — X,| is increasingly valid as n > oo. 


Newton’s method for finding a root a of f(x) = 0 sometimes requires the 
initial guess x, to be quite close to a in order to obtain convergence. Verify 
that this is the case for the root a = 7/2 of 


f(x) = cos (x) + sin? (50x) 


Give a rough estimate of how small |x, — «| should be in order to obtain 
convergence to a. Hint: Consider (2.2.6). 


18. 


19. 


20. 


21. 


22. 


PROBLEMS 121 


Write a program to implement Muller’s method. Apply it to the rootfinding 
problems in Problems 2, 3, and 6. 


Show that x = 1+ tan™!(x) has a solution a. Find an interval [a, 5] 
containing a such that for every x, € [a, 5], the iteration 


Xney =Lt+tan"(x,) n>0 


n 


will converge to a. Calculate the first few iterates and estimate the rate of 
convergence. 


Do the same as in Problem 19, but with the iteration 


Xeay =3—2log(l +e) xn>0 


To find a root for f(x) = 0 by iteration, rewrite the equation as 
x=x+cf(x) = g(x) 


for-some-constant c ¥ 0. If @ is a root of f(x) and if f(a) # 0, how should 
c be chosen in order that the sequence x,,, = g(x,,) converges to a? 


Consider the equation 
x=d+hf(x) 


with d a given constant and f(x) continuous for all x. For A = 0, a root is 
a = d. Show that for all sufficiently small h, this equation has a root a(h). 
What condition is needed, if any, in order to ensure the uniqueness of the 
root a(h) in some interval about d? 


The iteration x,,, = 2 — (1 + c)x, + cx? will converge to a = 1 for some 
values of c [provided x, is chosen sufficiently close to a]. Find the values of 
c for which this is true. For what value of c will the convergence be 
quadratic? 


Which of the following iterations will converge to the indicated fixed point 
a (provided x, is sufficiently close to a)? If it does converge, give the order 
of convergence; for linear convergence, give the rate of linear convergence. 


12 
(a) X,4, = —16+ 6x,+ — a=2 
x 


n 


2 1 ve 
(b) a ane a ae a=3 
() 12 4 
ce) xX. = 
n+l l+x a 


25. 


26. 


27. 


29. 
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Show that 
x,(x2 + 3a) 


n20 
3x2 +4 


Xn+1 = 


is a third-order method for computing Va. Calculate 


se va a Xned 
Limit 7a aaaaETTr Y 
no (Va — x,) 


assuming x, has been chosen sufficiently close to a. 


Using Theorem 2.8, show that formula (2.4.11) is a cubically convergent 
iteration method. 


Define an iteration formula by 


fleas) ich 


f'(xn) 


Show that the order of convergence of {x,} to a is at least 3. Hint: Use 
theorem 2.8, and let 


Xnt1 = 2nt1 7 ntl =n 


f(h(x)) 
f(x) 
There is another modification of Newton’s method, similar to the secant 


method, but using a different approximation to the derivative /’(x,). 
Define 


g(x) = h(x) - 


_ f+ fen)) = Fe) 
1G.) 


This one-point method is called Steffenson’s method. Assuming f’(a) # 0, 
show that this is a second-order method. Hint: Write the iteration as 
Xn41 = B(X,). Use f(x) = (x — a)h(x) with h(a) # 0, and then compute 
the formula for g(x) in terms of A(x). Having done so, apply Theorem 2.8. 


ich) 
"~ DG,) 


n=O 


D(x,) 


Xn+1 7% 


Given below is-a table of iterates from a linearly convergent iteration 
X,+1 = 8(X,)- Estimate (a) the rate of linear convergence, (b) the fixed 
point a, and (c) the error a — x,. 


xp 
1.0949242 
1.2092751 
1.2807917 
1.3254943 
1.3534339 
1.3708962 


MWhrwWNr OO 2 


31. 


32. 
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The algorithm Aitken, given in Section 2.6, can be shown to be second 
order in its speed of convergence. Let the original iteration be x,., = g(x,), 
n > Q. The formula (2.6.8).can be rewritten in the equivalent form 


2 
(xs o Xie 9) 


oS —— n>2 
(c5 a x,-2) = (x, 7 Kri§) 


a=k, =X,-2+ 


To examine the speed of convergence of the Aitken extrapolates, we 
consider the associated sequence 


[s(z,) - 2,]’ 


- The values z, are the successive values of X,, produced in the algorithm 


Aitken. 

For g(a) # 0 or 1, show that z, converges to a quadratically. This is 
true even if |g’(a)| > 1 and the original iteration is divergent. Hint: Do 
not attempt to use Theorem 2.8 directly, as it will be too complicated. 
Instead write 


g(x) =(x~a)h(x) h(a) = g’(a) # 0 
Use this to show that 
a— Zn+l as H(z,)(a “J ae 
for some function H(x) bounded about x = a. 


Consider the sequence 
x, =a+ Bo"+ yo", n20, |p} <1 


with 8, y # 0, which converges to a with a linear rate of p. Let x,,_, be the 
Aitken extrapolate: 


2 
Bs hate ea A ee, 
ee i es ~ Xn-1) = Ce las Rees) > 

Show that 


£,-7 = a + ap?" + bp" + c,p™ 


n 
where c, is bounded as n -> oo. Derive expressions for a and b. The 
sequence {%,} converges to a with a linear rate of p*. 


Let f(x) have a muitiple root a, say of multiplicity m = 1. Show that 


f(x) 
f'(x) 


K(x) = 
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33. 


35. 


37. 


has a as a simple root. Why does this not help with the fundamental 
difficulty in numerically calculating multiple roots, namely that of the large 
interval of uncertainty in a? 


Use Newton’s method to calculate the real roots of the following polynomi- 


als as accurately as possible. Estimate the multiplicity of each root, and 
then if necessary, try an alternative way of improving your calculated 
values. 
(a) x4 — 3.2x3 4+ 96x? +-4.608x — 3.456 
(b) x5 + 9x4 — 1.62x> — 1.458x? + .6561x + .59049 
Use the program from Problem 2 to solve the following equations for the 
root a = 1. Use the initial interval [0,3], and in all cases use « = 10~° as 
the stopping tolerance. Compare the results with those obtained in Section 
2.8 using Brent’s algorithm. 

@) (x- 11 + (x-1)7]=0 

Gi) x?-1=0 
Gi) -—1+%x3+4+x(-34+x)) =0 

(iv) (x — l)exp(-1/(x — 1)?) =.0 


Prove the lower bound in (2.9.6), using the upper bound in (2.9.6) and the 
suggestion in (2.9.7). 


Let p(x) be a polynomial of degree n. Let its distinct roots be denoted by 
Q,,..-,@,, Of respective multiplicities m,,..., ™m,. 


(a) Show that 


p'(x) . mM, 


p(x) jer % 7 


(b) Let c¢ be a number for which p’(c) # 0. Show there exists a root a of 
p(x) satisfying 


lja-—cl <n 


(c) 
p’(c) 


For the polynomial 


p(x) =ay+a,x+--- +a,x #0 


38. 


39. 


40. 


41. 


42. 
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define 
R= [ao] + lay] +--+ +]a,_4| 


ja 


nl 


Show that every root x of p(x) = 0 satisfies 
|x| < Max {R, VR} 


Write a computer program to evaluate the following polynomials p(x) for 
the given values of x and to evaluate the noise in the values p(x). For each 
x, evaluate p(x) in both single and double precision arithmetic; use their 
difference as the noise in the single precision value, due to rounding errors 
in the evaluation of p(x). Use both the ordinary formula (2.9.1) and 
Horner’s rule (2.9.8) to evaluate each polynomial; this should show that the 
noise is different in the two cases. 


(a) p(x) = x4 — 5.7x3 — 47x? + 29.865x — 26.1602 -3<x<5 
with steps of h = 0.1 for x. 


(b) p(x) =x* — 5.4x3 + 10.56x? — 8.954x + 2.7951 1<x<12 
in steps of h = .001 for x. 


Note: Smaller or larger values of h may be appropriate on different 
computers. Also, before using double precision, enter the coefficients 
in single precision, for a more valid comparison. 


Use complex arithmetic and Newton’s method to calcmate a complex root 
of 


p(z) =z4 — 323 + 2027 + 44z + 54 
located near to zy = 2.5 + 4.5i. 


Write a program to find the roots of the following polynomials as accu- 
rately as possible. 


(a) 676039x!? — 1939938x!° + 2078505x® — 1021020x° + 225225x* 
—18018x? + 231 


(b) x* — 4.096152422706631x? + 3.284232335022705x? 
+ 4.703847577293368x — 5.715767664977294 


Use a package rootfinding program for polynomials to find the roots of the 
polynomials in Problems 38, 39, and 40. 


For the example f(x) = (x — 1)(x — 2)--- (x — 7), (2.9.21) in Section 2.9, 
consider perturbing the coefficient of x‘ by ¢,;x', in which e, is chosen so 
that the relative perturbation in the coefficient of x' is the same as that of 
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43. 


45. 


47. 


49. 


the example in the text in the coefficient of x®, What does the linearized 
theory (2.9.19) predict for the perturbations in the roots? A change in which 
coefficient will lead to the greatest changes in the roots? 


The polynomial f(x) = x5 — 300x? — 126x + 5005 has a = 5 as a root. 
Estimate the effect on « of changing the coefficient of x° from 1 tol +. 


The stability result (2.9.19) for polynomial roots can be generalized to 
general functions. Let a be a simple root of f(x) = 0, and let f(x) and: 
g(x) be continuously differentiable about a. Define F(x) = f(x) + €g(x). 
Let a(e) denote a root of F(x) = 0, corresponding to a = a(0) for small e. 
To see that there exists such an a(e), and to prove that it is continuously 
differentiable, use the implicit function theorem for functions of one vari- 
able. From this, generalize (2.9.19) to the present situation. 


Using the stability result in Problem 44, estimate the root a(e) of 
x - tan(x) +. =0 


Consider two cases explicitly for roots a of x — tan(x) = 0: 
(1) a € (.52,1.57), (2) a € (31.57, 32.57). 


Consider the system 
5 ‘ 5 


SS SSS = 
L+(x+y)? d 1+(x-y) 


Find a bounded region D for which the hypotheses of Theorem 2.9 are 
Satisfied. Hint: What will be the sign of the components of the root «a? 
Also, what are the maximum possible values for x and y in the preceding 
formulas? 


Consider the system 


- x2 


e zt 2 
ee ee ee y= .5+ h-tan “! (x? + y’) 
Show that if A is chosen sufficiently small, then this system has a unique 
solution a within some rectangular region. Moreover, show that simple 


. iteration of the form (2.10.4) will converge to this solution. 


Prove Corollary 2.10. Hint: Use the continuity of the partial derivatives of 
the components of g(x). 


Prove that the iterates {x,,} in Theorem 2.9 will converge to a solution of 
x = g(x). Hint: Consider the infinite sum 


oo 
Xo a F bet. 39 xy] 


n=O 


51. 


52. 


53. 
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Its partial sums are 


N-1 
Xo + ‘3 [xn41 —x,)] =Xy 


n=0 


Thus if the infinite series converges, say to a, then xy converges to a. Show 
the infinite series converges absolutely by showing and using the result 


X41 — Xalleo S AUK, ~ Xn—a leo 
Following this, show that « is a fixed point of g(x). 
Using Newton’s method for nonlinear systems, solve the nonlinear system . 
epyt ad xtoye =] 


The true solutions are easily determined to be (+ v2.5, + 71.5). As an 
initial guess, use (Xo, Yo) = (1.6, 1.2). 


Solve the system 
xrtxyi=9 3x*y-yi = 4 


using Newton’s method for nonlinear systems. Use each of the initial 
guesses (Xo, Yo) = (1.2,2.5), (—2,2.5), (—1.2, —2.5), (2, —2.5). Observe 
which root to which the method converges, the number of iterates required, 
and the speed of convergence. 


Using Newton’s method for nonlinear systems, solve for all roots of the 
following nonlinear system. Use graphs to estimate initial guesses. 


(a) x*+y?—-2x-2y+1=0 x+y-—2xyp=0 


(b) x? + 2xy+y?-x+y-4=0 
5x? — 6xy + Sy? + 16x — 16y +12 =0 


Prove that the Jacobian of 


g(x) = x — F(x)” 'f(x) 


is zero at any root a of f(x) = 0, provided F(«) is nonsingular. Combined 
with Corollary 2.10 of Section 2.10, this will prove the convergence of 
Newton’s method. 


Use Newton’s method (2.12.4) to find the minimum value of the function 
f(x) = x4 4+ x,x,+ (1+ x,) 


Experiment with various initial guesses and observe the pattern of conver- 
gence. 


THREE 


INTERPOLATION THEORY 


The concept of interpolation is the selection of a function p(x) from a given 
class of functions in such a way that the graph of y = p(x) passes through a 
finite set of given data points. In most of this chapter we limit the intempolaung 
function p(x) to being a polynomial. 

Polynomial interpolation theory has a number of important uses. In this text, 
its primary use is to furnish some mathematical tools that are used in developing 
methods in the areas of approximation theory, numerical integration, and the 
numerical solution of differential equations. A second use is in developing means 


~ for working with functions that are stored in tabular form. For example, almost 


everyone is familiar from high school algebra with linear interpolation in a table 
of logarithms. We derive computationally convenient forms for polynomial 
interpolation with tabular data and analyze the resulting error. It is recognized 
that with the widespread use of calculators and computers, there is far less use 
for table interpolation than in the recent past. We have included it because the 
resulting formulas are still useful in other connections and because table interpo- 
lation provides us with convenient examples and exercises. 

The chapter concludes with introductions to two other topics. These are (1) 
piecewise polynomial interpolating functions, spline functions in particular, and 
(2) interpolation with trigonometric functions. 


3.1 Polynomial Interpolation Theory 


Let xo, x,,..., x, be distinct real or complex numbers, and let yo, y;,..., y, be 
associated function values. We now study the problem of finding a polynomial 
p(x) that interpolates the given data: 


p(x;)=y;, i=0,1,...,0 | (3.1.1) 


Does such a polynomial exist, and if so, what is its degree? Is it unique? What is.a 
formula for producing p(x) from the given data? 
By writing 


P(x) =aypta,x+-++ +a,x 


for a general polynomial of degree m, we see there are m+ 1 independent 


131 


132 INTERPOLATION THEORY 


parameters dg, a),..., 4,,- Since (3.1.1) imposes n + 1 conditions on p(x), it is 
reasonable to first consider the case when m=n. Then we want to find 
Qo, 4,..., a, Such that 


Ay + AX + axe +--+ ta, xt = yy 


Ag + a,x, + ayx2 +--+ ta,x" = y, (3.1.2) 


This is a system of n + 1 linear equations in n + 1 unknowns, and solving it is 
completely equivalent to solving the polynomial interpolation problem. In vector 
and matrix notation, the system is 


Xa=y 
with 


X=[x/] i f=0,1,...,7 (3.1.3) 


@= (ay) ay;--05 4,1" yey val” 


The matrix X is called a Vandermonde matrix. 


Theorem 3.1 Given'n+1 distinct points x9,...,x, and n+ 1 ordinates 
Yor-+-s Yay there is a polynomial p(x) of degree < n that inter- 
polates y, at x, i= 0,1,..., 2. This polynomial p(x) is unique 
among the set of all polynomials of degree at most n. 


Proof Three proofs of this important result are given. Each will furnish some 
needed information and has important uses in other interpolation prob- 
lems. 


(i) It can be shown that for the matrix X in (3.1.3), 


dex) = IT o, - x) (3.1.4) 


Osj<isn 


(see Problem 1). This shows that det(X) # 0, since the points x; 
are distinct. Thus X is nonsingular and the system Xa = y has a 
unique solution a. This proves the existence and uniqueness of an 
interpolating polynomial of degree < n. 


(ii) By a standard theorem of linear algebra (see Theorem 7.2 of 
Chapter 7), the system Xa = y has a unique solution if and only if 
the homogeneous system Xb = 0 has only the trivial solution 
6 = 0. Therefore, assume Xb = 0 for some b. Using b, define 


P(x) = bo + Bx +--+ +b,x" 


(iii) 
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From the system Xb = 0, we have 
p(x,)=0 i=0,1,...,7 
The polynomial p(x) has n + 1 zeros and degree p(x) < n. This 
is not possible unless p(x) = 0. But then all coefficients b, = 0, 


i= 0,1,...,, completing the proof. 


We now exhibit explicitly the interpolating polynomial. To begin, 
we consider the special interpolation problem in which 


y= y=0 for ji 


for some i, 0 < i < n. We want a polynomial of degree < n with 
the n zeros x, j # i. Then 


p(x) = e(x — x9) +++ (x — x1) (% — X)44) -°+ (x — x,) 
for some constant c. The condition p(x;) = 1 implies 
Ce I(x; = Hq) = (xy a eG Raa)? ee a | 


This special polynomial is written as 


L(x) = (=| i=0,1,....2 (3.1.5) 


fHI\ xX; — X; 
To solve the general interpolation problem (3.1.1), write 
P(x) = Yolo(x) + vil (x) + ++ +H la (%) 
With the special properties of the polynomials /,(x), p(x) easily 
satisfies (3.1.1). Also, degree p(x) <n, since all /;(x) have de- 
gree n. 


To prove uniqueness, suppose q(x) is another polynomial of 
degree < n that satisfies (3.1.1). Define 


r(x) = p(x) — 4(x) 
Then degree r(x) <n, and 
r(x;) = p(x;) — 9(x,) =y,-y, = 0 i=0,1,...,2 


Since r(x) has n + 1 zeros, we must have r(x) = 0. This proves 
P(x) = q(x), completing the proof. 4 


Uniqueness is a property that is of practical use in much that follows. We 
derive other formulas for the interpolation problem (3.1.1), and uniqueness says 
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they are the same polynomial. Also, without uniqueness the linear system (3.1.2) 
would not be uniquely solvable; from results in linear algebra, this would imply 
the existence of data vectors y for which there is no interpolating polynomial of 
degree < 7. 

The formula 


PrAlx) = Qo yili(x) (3.1.6) 
i=0 
is called Lagrange’s formula for the interpolating polynomial. 
Example 
x Xy x — Xq (x, — x) % + (x — Xo) 
Pi(x) = Yo + Mil a 
Xo 7 X41 x; ~ Xo x, ~— Xo 
Pee aS ae, (SSS 2). 
: (X9 — X1)(Xo — X2) : (x, — Xo)(xy — x2) ; 


(x — Xo)(x = ¥1)_ 


(x2 — Xo)(%_ - x,)°? 


The polynomial of degree < 2 that passes through the three points (0, 1), (— 1, 2), 
and (1, 3) is 


ip (x — 0)(x — 1) jek ee) 
BAe (£0 —1) i-O0C1L- 1) =O) 
i) <3 


=1t 5x4 5x? 


If a function f(x) is given, then we can form an approximation to it using the 
interpolating polynomial 


pol 851) = pala) = EHC di(2) (1.7 


This interpolates f(x) at Xo,...,x,. For example, we later consider f(x) = 
log, 9 x with linear interpolation. The basic result used in analyzing the error of 
interpolation is the following theorem. As a notation, #{a, b,c,...} denotes 
the smallest interval containing all of the real numbers a, b,c,.... 


Theorem 3.2 Let Xo, X,,..-,X,, be distinct real numbers, and let f be a given 
real valued function with n+ 1 continuous derivatives on the 
interval I, = #’{1, xo,...,X,}, with tf some given real number. 
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Then there exists € & J, with 


fo) ~ ES play) = AP poeo(6) 2.) 


Proof Note that the result is trivially true if ¢ is any node point, since then both 


sides of (3.1.8) are zero. Assume t does not equal any node point. Then 
define 


B(x) =H) C2) ral) = E Haha) 
ey ¥le) 
G(x) = E(x) - say? forall x € J, (3.1.9) 
with 
¥(x) = (x — xg) == (x=) 


The function G(x) is n +1 times continuously differentiable on the 
interval J,, as are E(x) and ¥(x). Also, 


_ ¥(x;) eat 
Coy SRA) a 8 i=0,1,...,” 


G(t) = E(t) —- E(t) =0 
Thus G has n + 2 distinct zeros in J,. Using the mean value theorem, G’ 
has n + 1 distinct zeros. Inductively, G(x) has n + 2 —/ zeros in [,, 


for j = 0,1,...,2 + 1. Let € be a zero of G“"*(x), 


GitD(E) = 0 


Since 

EM*O(x) = f(x) 

HOtD(x) = (n+ 1)! 
we obtain 

aren(xy = pone) - BE Macy 
¥(1) 
Substituting x = € and solving for E(r), 
¥(t) 
E(t)= (+D! dea! 9) 


the desired result. 
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This may seem a “tricky” derivation, but it is a commonly used 
technique for obtaining some error formulas. Ee 


Example For n = J, using x in place of ¢, 


(x, — x) f (x9) + (x — Xo) f(x) = (eA Xp) Ox) 


X,—-x 2 
1 0 


f(x) - ee) 


(3.1.10) 


for some £, € #/{x,, x, x}. The subscript x on &, shows explicitly that £¢ 
depends on x; usually we omit the subscript, for convenience. 

We now apply the n = 1 case to the common high school technique of linear 
interpolation in a logarithm table. Let 


f(x) = logy x 


Then f(x) = —logy e/x?, logy) e = 0.434. In a table, we generally would have 
Xo <x < x,. Then 


(x= Xo)(x, =x) ; logio é 
2 ¢ 


E(x) = 


Xo sf<Sx, 


This gives the upper and lower bounds 


login € (x - Xo)(x, rx) logio & (ee Xo)(xy = x) 
et ae cae eS ee 
xj 2 xo 2 


This shows that the error function E(x) looks very much like a quadratic 
polynomial, especially if the distance h = x, — xq is reasonably small. For a 
uniform bound on [X9, x,], 


h? 
Max (x, — x)(x-x)) = — 
XpSxSX 4 


h? 434 .0542h? 
llogiox — pila) S$ | a = a S 09424? (3.1.11) 
Xo 


for xg = 1, as is usual in a logarithm table. Note that the interpolation error in a 
standard table is much less for x near 10 than near 1. Also, the maximum error is 
near the midpoint of [xp9, x,]. 

For a four-place table, h = .01, 


llogigx — p(x) <5.42x10°° 31s x9 <x, < 10 


Since the entries in the table are given to four digits (e.g., log) 2 = .3010), this 
result is sufficiently accurate. Why do we need a more accurate five-place table if 


POLYNOMIAL INTERPOLATION THEORY 137 


the preceding is so accurate? Because we have neglected to include the effects of 
the rounding errors present in the table entries. For example, with log,, 2 = .3010. 


|logig 2 — .3010| < .00005 
and this will dominate the interpolation error if x9 or x, = 2. 


Rounding error analysis for linear interpolation Let 
f (xo) = fo + €o fla) afte 
with fo and f, the table entries and €),€, the rounding errors. We will assume 
leol, lal <e 


for a known e. In the case of the four-place logarithm table, « = .00005. 
We want to bound 


(x) = f(x) os ee eset ee ae a (3.1.12) 
xy — Xo j 


Using f; = f(x;) — €;, 
(x, - x) f(X9) +: (= Xo)f(x;) 


Xx; — Xq 


&(x) = f(x) - 


(x, — x)eg + (x — xp) ey 


X — Xq . 
= E(x) + R(x) (3.1.13) 


(x — o)(x ~ *1) 


E(x) = ease ecm: G2) Ee [x9, x] 


The error &(x) is the sum of the theoretical interpolation error E(x) and the | 
function R(x), which depends on €9,¢€,. Since R(x) is a straight line, its 
maximum on [X9, x,] is attained at an endpoint, . 

Max |R(x)] = Max {leol, lal} << (3.1.14) 


XpSxsx 


With x, =x 9 th, x9 Sx <x, 


h? 
|6(x)| < Peres If’'(t)| + Max {leo}, leat} ~ (3.1.15) 
XpStsx, 


Example For the earlier logarithm example using a four-place table, 


|@(x)| < 5.42 x 10-°+5x1075=5.5 x 1075 
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For a five-place table, h = .001, « = .000005, and 
|\@(x)| < 5.42 x 10-8 + 5 x 10° = 5.05 x 107§ Xp SxXSXy 


The rounding error is the only significant error in using linear interpolation in a 
five-place logarithm table. In fact, it would seem worthwhile to increase the 
five-place table to a six-place table, without changing the mesh size h. Then we 
would have a maximum error for &(x) of 5.5 X 107’, without any significant 
increase in computation. These arguments on rounding error generalize to higher 
degree polynomial interpolation, although the result on Max |R(x)| is slightly 
more complicated (see Problem 8). 


None of the results of this section take into account new rounding errors that 
occur in the evaluation of p,(x). These are minimized by results given in the next 
section. 


3.2 Newton Divided Differences 


The Lagrange form of the interpolation polynomial can be used for interpolation 
to a function given in tabular form; tables in Abramowitz and Stegun (1964, 
chap. 25) can be used to evaluate the functions /;(x) more easily. But there are 
other forms that are much more convenient, and they are developed in this and 
the following section. With the Lagrange form, it is inconvenient to pass from 
one interpolation polynomial to another of degree one greater. Such a compari- 
son of different degree interpolation polynomials is a useful technique in deciding 
what degree polynomial to use. The formulas developed in this section are for 
nonevenly spaced grid points {x,;}. As such they are convenient for inverse 
interpolation in a table, a point we illustrate later. These formulas are specialized 
in Section 3.3 to the case of evenly spaced grid points. 
We would like to write 


P(X) =Pp-1 + C(x) C(x) = correction term (3.2.1) 


Then, in general, C(x) is a polynomial of degree n, since usually degree ( p,,_,) 
=n — | and degree (p,) = n. Also we have 


C(x:) = Pali) Pri) = f0;) — fj) = 0 = 0,..,0 71 
Thus 
CO) Sat xg) (= Fina) 
Since p,(x,) = f(x,), we have from (3.2.1) that 


fxn) = Pr-il%n) 


On (hn Xo) (Xn = Fnaa) 


n 
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For reasons derived below, this coefficient a, is called the nth-order Newton 
divided difference of f, and it is denoted by 


=F [Kos Xysss ee | 
Thus our interpolation formula becomes 
Py(X) = Pr-i€x) + (x = 9) ++ (% — Xp) PDX 00--22 Xn) (3.2.2) 


To obtain more information on a,, we return to the Lagrange formula (3.1.7) 
for p,(x). Write 


(x) = (x= ¥9) + (x = x,) (3.2.3) 
Then 
W,/(x;) = (x,-— x9) °-* (x, - Xj-1)(x; Ree) eee, ) 


and if x is not a node point 


Mle) | 
p(x) = Eg ay fe) (3.2.4) 


Since a, is the coefficient of x” in p,(x), we use the Lagrange formula to obtain 
the coefficient of x”. By looking at each nth-degree term in the formula (3.2.4), 
we obtain 


(3.2.5) 


From this formula, we obtain an important property of the divided difference. 
Let (io, i),...,7,) be some permutation of (0,1,...,”). Then easily 


since the second sum is merely a rearrangement of the first one. But then 
fa eS aeeee ae af fx Rp sncyke | (3.2.6) 


for any permutation (ip,...,i,,) of (0,1,..., 7). 
Another useful formula for computing f[x9,..., x,] is 


fixesxwes%)= ba teeeeeas Boel eE (3.2.7) 
X, =X 


which also explains the name of divided difference. This result can be proved 
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Table 3.1 Format for constructing divided ‘differences of f(x) 
x, f(x,) S145 41] [Xe Mere Mea) 


Xo f 
ffxo, x] 
xy f f 1X0. Xs XQ] °° 
fx, x2] 
x2 h FI x1, X25 X3] +°° 
‘ f x2, x3] 
x3 f ; f[x%21%3,%4) 0°° 
fx, x4] 
X4 fa FL X51 Xas Xs] °° 
Fixa, Xs] 


x5 f 


from (3.2.5) or from the following alternative formula for p,,(x): 


(ape) pO) Se x) pe) 


Xn — Xo 


P(x) = (3.2.8) 


with p®?~)(x) the polynomial of degree <n -— 1 interpolating f(x) at 
(Xo,---)X,_—1} and p%:”)(x) the polynomial interpolating f(x) at {x,,...,x,}. 
The proofs of (3.2.7) and (3.2.8) appear in Problem 13. 

Returning to the formula (3.2.2), we have the formulas 


Po{x) = f (xo) 
Pi(x) =f(xo) + (x- xo)f [Xo x1] 


Py( x) = f(X0) + (4 = Xo) fl %0. Xa] + (4 — X0)(4 — a) FLX 0, #1» *2] 


nes (eX) 4 (= Kya) f [Xr eee eal (3.2.9) 


This is called Newton’s divided difference formula for the interpolation polynomial. 
It is much better for computation than the Lagrange formula (although there are 
variants of the Lagrange formula that are more efficient than the Lagrange 
formula). 

To construct the divided differences, use the format shown in Table 3.1. Each 
numerator of a difference is obtained by differencing the two adjacent entries in 
the column to the left of the column you are constructing. 


Example We construct a divided difference table for f(x) = yx, shown in 
Table 3.2. We have used the notation D’f(x;) = f[x;, X;21,.--, X;+,]. Note that 
the table entries for f(x,) have rounding errors in the seventh place, and that this 
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Table 3.2 Example of constructing divided differences 


x 


2.0 


2.1 


2.2 


2.3 


2.4 


f(x) fli Xia] D*f{x;] D*f{x;] D*flx;] 
1.414214 
34924 
1.449138 ~ .04110 
34102 009167 
1.483240 — 03835 —~ 002084 
33335 008333 
1.516575 — 03585 
32618 
1.549193 


affects the accuracy of the resulting divided differences. A discussion of the 
effects of rounding error in evaluating p,(x) in both its Lagrange and Newton 
finite difference form is given in Powell (1981, p. 51). 


A simple algorithm.can be given for constructing the differences 


faa) bx: Xi) ees X15 Kalieves [X05 eer Fa 


which are necessary for evaluating the Newton form (3.2.9). 


Algorithm Divdif (d, x, n) 


1. Remark: d and x are vectors with entries f(x;) and x,, 
i=0,1,...,m, respectively. On exit, d, will contain 
F[Xo.---s X)- 

2. Do through step 4 for i= 1,2,...,n. 

3. Do through step 4 for j=n,n—1,...,i. 

4. d= (d;— d;_1)/(x; — x;-;). 


5. Exit from the algorithm. 


To evaluate the Newton form of the interpolating polynomial (3.2.9), we give a 
simple variant of the nested polynomial multiplication (2.9.8) of Chapter 2. 


Algorithm 


Interp (d, x, n,t, p) 


1. Remark: On entrance, d and x are vectors containing 
f{x9,...,x,] and x, i=0,1,...,”, respectively. On. exit, p 
will contain the value p,(t) of the nth-degree polynomial inter- 
polating f on x. 
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2. p=d, 
3. Do through step 4 fori =n—1,n —2,...,0. 
4. p=d,+(t—x;)p 
5. Exit the algorithm. 


Example For f(x) = vx, we give in Table 3.3 the values of P, (x) for various 
values of x and n. The highest degree polynomial p,(x) uses function values at 
the grid points x, = 2.0 through x, = 2.4. The necessary divided differences are 
given in the last example, Table 3.2. 


When a value of x falls outside #{ x9, x,,..., x, }, we often say that p,(x) 
extrapolates f(x). In the last example, note the greater inaccuracy of the 
extrapolated value p,(2.45) as compared with p,(2.05) and p,(2.15). In this text, 
however, the word interpolation always includes the possibility that x falls 
outside the interval 3{xp,..., x,}- 

Often we know the value of the function f(x), and we want to compute the 
corresponding value of x. This is called inverse interpolation. It is commonly 
known to users of logarithm tables as computing the antilog of a number. To 
compute x, we treat it as the dependent variable and y = f(x) as the indepen- 
dent variable. Given table values (x,, y,), i = 0,...,., we produce a polynomial 
P,(y) that interpolates x; at y,, i= 0,...,n. In effect, we are interpolating the 
inverse function g(y)=/~'(y); in the error formula of Theorem 3.2, with 


x= f(y), 


(y= yo) <-> (9 —y) 


(n+ i)! gee) (3.2.10) 


x — ply) = 


for some £ © #{ y, Yo, Yis---> Yn }- Uf they are needed, the derivatives of g(y) 
can be computed by differentiating the composite formula 


g( f(x)) =x 
for example, 


W)- 75 for y = f(x) 


Example Consider the Table 3.4 of values of the Bessel function J)(x), taken 


Table 3.3. Example of use of Newton’s formula (3.20) 


x Pi(x) p2(x) p3(x) Pa(x) vx 
2.05 1.431676 1.431779 1.431782 1.431782 1.431782 
2.15 1.466600 1.466292 1.466288 1.466288 1.466288 


2.45 1.571372 1.564899 1.565260 1.565247 1.565248 
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Table 3.4 Values of Bessel function J)(x) 


x Jo(x) 
2.0 .2238907791 
2.1 1666069803 
2.2 1103622669 
2.3 0555397844 
2.4 0025076832 
2.5 — 0483837764 
2.6 — 0968049544 
2:7 — 1424493700 
2.8 — 1850360334 


2.9 ~ 2243115458 


Table 3.5 Example of inverse interpolation 


n Pry) Pr(¥) — Pn-1¥) Bl Yor+++2 nl 
0 2.0 2.0 

1 2.216275425 2.16E —1 — 1.745694282 
2 2.218619608 2.34E — 3 2840748405 
3 2.218686252 6.66E — 5 _ —.7793711812 
4 2.218683344 -2.91E — 6 .7648986704 
5 2.218683964 6.20E — 7 —1.672357264 
6 2.218683773 —L91E-7 3.477333126 


from Abramowitz and Stegun (1964, chap. 9). We will calculate the value of x for 
which J)(x) = 0.1. Table 3.5 gives values of p,(y) for n =0,1,...,6, with 
Xq = 2.0. The polynomial p,(y) is interpolating the inverse function of Jo(x), 
call it g(y). The answer x = 2.2186838 is correct to eight significant digits. 
An interpolation error formula using divided differences Let 1 be a real number, 
distinct from the node points x9, x,,..., x, Construct the polynomial interpolat- 
ing to f(x) at x9,...,x,, and ¢: 
Pn i(x) = f (xp) + (x _ Xo) f[Xo, x,] a eet 
He 2G) (Ea oy) f xgeesy yl 
+(x = Xo) i. (x = x, )F [Xe Xyrerer Xn, t] 
= p,(x) + (x ~ Xq) ated (x ~ Ke) fl Meese Xn t] 

Since p,.,(t) = f(t), let x = ¢ to obtain 


F(t) — pylt) = (t— x9) +++ (t— xq) FLX 00-202 Xue tf} (3.2.11) 


where p,(t) was moved to the left side of the equation. This gives us another 
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formula for the error f(t) — p,(t), one that is very useful in many situations. 
Comparing this with the earlier error formula (3.1.8) and canceling the multiply- 
ing polynomial ¥,,(t), we have 


7 fern) 


dbs cree Ama ge (+1)! 


for some € € #{ xp, X1,..-,X,,¢}. To make this result symmetric in the argu- 
ments, we generally let t = x,,,, 2 =m — 1, and obtain 


_ f8) 
{ 


= some £€H’{X9,...,Xm} (3.2.12) 


PE Ri Kin oy Xe 


With this result, the Newton formula (3.2.9) looks like a truncated Taylor series 
for f(x), expanded about x9, provided the size x,, — x9 is not too large. 


Example From Table 3.2, of divided differences for f(x) = yx, 
f[2.0,2.1,...,2.4] = —.002084 
Since f(x) = —15/(16xyx), it is easy to show that 


f (2.3103) 
4! 


so € = 2.31 in (3.2.12) for this case. 


= — 002084 


‘Example If the formula (3.2.12) is used to estimate the derivatives of the inverse 
function g(y) of Jo(x), given in a previous example, then the derivatives g‘")( y) 
are growing rapidly with n. For example, 


BL Yoresss ¥%] = 3.48 
g(£) = (6!)(3.48) = 2500 


for some & in [—.0968, .2239]. Similar estimates can be computed for the other 
derivatives. 


To extend the definition of Newton divided difference to the case in which 
some or all of the nodes coincide, we introduce another formula for the divided 
difference. 


Theorem 3.3 (Hermite—Gennochi) Let xo, x,,...,x, be distinct, and let 
f(x) be n times continuously differentiable on the interval 
FE { Xo, Xy,--6,X,}. Then 


Fixes eS [--- fF (toxo + +++ +t,x,) dt,...dt, (3.2.13) 


NEWTON DIVIDED DIFFERENCES 145 


in which 


T= ((. beta hl all 220, p< | (3.2.14) 
1 


Note that ¢5 > 0 and Loe; = 1. 


Proof We show that (3.2.13) is true for n = 1 and 2, and these two cases should 
suggest the general induction proof. 


1. n=1. Then 7, = (0,1). 
1 e 1 ¥ 
[foro + t,x) dt, = [yf (x9 + (x, — x9) dty 


= 


1 
x1 - x51 (% 1(% Xo)) Lae 


- f(x) = f(y) Sees] 


X, — Xp 


2. n=2. Then 7, is the triangle with vertices (0,0), (0,1), and (1,0), 
shown in Figure 3.1. 


fff Coxe + x1 + tox2) ty dt, 
1 si-t ~ 
= i. [ ‘F (X9 + t(%, — X09) + t,(%2 — x0) ate at 


Q=1-4 


me : 
Fe I X— Xo Lf (x9 + (x1 ~ X0) + (ae - Xo))] 220 dt, 


1 


X_— Xo 


[fires F (2x, — x2) dt, 


= [7% + t)(x, — xo)) a,| 


1 


= {f 1x1, *2] — fl x0, x1J} =f [xo x1, x2] 


X27 Xo 
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(1, 0) 


Figure 3.1 Region 7,. 


Do the general case by a similar procedure. Integrate once and 
reduce to one lower dimension. Then invoke the induction hypothe- 
sis and use (3.2.7) to complete the proof. 


We can now look at f[x9, x,,..., X,] using (3.2.13). Doing so, we see that if 
f(x) is n times continuously differentiable on #{ x9,..., x,,}, then f[x9,..-, x, 
is a continuous function of the n variables x9, x,,..., x,, regardless of whether 


they are distinct or not. For example, if we let all points coalesce to x, then for 
the nth-order divided difference, 


S{xq.--+. 0} = fof PC) dt,... dt, 


= f© (x9) - Vol (1,). 


From Problem 15, Vol(7,) = 1/n!, and thus 


n) 
fl Xqssicay kel = Po) (3.2.15) 


This could have been predicted directly from (3.2.12). But if only some of the 
nodes coalesce, we must use (3.2.13) to justify the existence of f[x9,..., x,]. 
In applications to numerical integration, we need to know whether 


d 
Gif leon Xa x] (3.2.16) 


exists. If f is n + 2 times continuously differentiable, then we can apply Theorem 
3.3. By applying theorems on differentiating an integral with respect to a 
parameter in the integrand, we can conclude the existence of (3.2.16). More, 
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directly, 
d whe SL SOe ee EE HL se ae 
Gil lx0r-+- Xm x) = Limit i 
* ig ME Roeiss Busco) SF [es tpyee ss el 
= Limit ——____--________——- 
h—0 h 
= Limit f[x, X9,..-,X,, x th] 
h—0 
2 [xX Satake el | 
d 
Gul [or Maes HM] SL Xp ksw Xe eel (3.2.17) 


The existence and continuity of the right-hand side is guaranteed using (3.2.13). 

There is a rich theory involving polynomial interpolation and divided differ- 
ences, but we conclude at this point with one final straightforward result. If f(x) 
is a polynomial of degree m, then 


polynomial of degree m —n-1 n<m-1 . 
Tl Xovet-o ee) = an =m-i1 (3.2.18) 
0 n>m-1 


where f(x) = a,,x™ + lower degree terms. For the proof, see Problem 14. 


3.3 Finite Differences and Table-Oriented 
Interpolation Formulas 


In this section, we introduce forward and backward differences, along with 
interpolation formulas that use them. These differences are referred to collec- 
tively as finite differences, and they are used to produce interpolation formulas 
for tables in which the abscissae {x,;} are evenly spaced. Such interpolation 
formulas are also used in the numerical solution of ordinary and partial differen- 
tial equations. In addition, finite differences can be used to determine the 
maximum degree of interpolation polynomial that can be used safely, based on 
the accuracy of the table entries. And finite differences can be used to detect 
noise in data, when the noise is large with respect to the rounding errors or 
uncertainty errors of physical measurement. This idea is developed in Section 3.4. 
For a given h > 0, define 


A,f(z) = f(z +h) - f(z) 
Generally the h is understood from the context, and we write 


Af(z) = f(z +h) - f(z) (3.3.1) 
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This quantity is called the forward difference of f at z, and A is called the 
forward difference operator. We will always be working with evenly spaced node 
points x; = xX) + ih, i = 0,1,2,3,.... Then we write 


Af(x;) = f(xis1) — f(x) 
or more concisely, 
Afi=fiar~fh fla) =f (3.3.2) 
For r > 0, define 
A 'f(z) = Af(z +h) - Af(z) (3333) 


with A°f(z) = f(z). The term A’f(z) is the rth-order forward difference of f at 
z. Forward differences are quite easy to compute, and examples are given later in 
connection with an interpolation formula. 

We first derive results for the forward difference operator by applying the 
results on the Newton divided difference. 


Lemma lI For k > 0, 


eye 
flxor xp... x] = A‘f, (3.3.4) 


~ ktak 


Proof For k = 0, the result is trivially true. For k = 1, 


eds 


x; — Xo 


1 
flxo. x] = oa Zoh 


which shows (3.3.4). Assume the result (3.3.4) is true for all forward 
differences of order k < r. Then for k = r + 1, using (3.2.7), 


(Ab ceeeere aree =f xosesy x.) 
flxotinenxl = eee eee ee 
Xpey Xo 


Applying the induction hypothesis, this equals 
2 ee at Ay 2 sae = ee ee Artif | 
(r+ DhAbrtar | ortar yr & tare m8 


We now modify the Newton interpolation formula (3.2.9) to a formula 
involving forward differences in place of divided differences. For a given value of 
x at which we will evaluate the interpolating polynomial, define 
bake 2 

h 


B= 
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to indicate the position of x relative to x9. For example, » = 1.6 means x is 6/10 
of the distance from x, to x,. We need a formula for 


erage ay) 
with respect to the variable p: 
X—X; = Xqt ph — (xq + jh) = (w—Jf)h 
(x — xq) -++(x-—x,) = pla — 1) +++ (w = kB (3.355) 


Combining (3.3.4) and (3.3.5) with the divided difference interpolation formula 
(3.2.9), we obtain 


Afo , Mf 
Pa(X) = fot wh: > + w(w ~ 1)h?- 35 
A" 
Peer cea ee ee eo ee 
nth" 
Define the binomial coefficients, 
py B(w—1)--+(u-k +1) 
(i). cD ge 8 
B\ _ 
and (*)= 1. Then 
“ ({B\,. x ~ Xo 
plx)= (Tf w= = (3.3.7) 


j=0 


This is the Newton forward difference form of the interpolating polynomial. 


-Table 3.6 Format for constructing forward differences 


x; fi Af, A’, A’; 
Xo fo 

Af 
xy fi Ng 0 

A Af 
Xy h Xf, 

Af, Wf 
x3 fh Af, 

Af, Af, 
X4 ts Wf, 

Af, 


Xs fs 
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Table 3.7 Forward difference table for f(x) = Vx 


x, f Af, Af, A’f, MY, 
2.0 1.414214 
034924 
2.1 .1.449138 — 000822 
034102 000055 
ow) 1.483240 — 000767 — 000005 
033335 .000050 
2.3 1.516575 ~ 000717 
032618 
2.4 1.549193 


Example For n= 1, 
P(x) = fo t+ pAfy (3.3.8) 


This is the formula that most people use when doing linear interpolation in a 
table. 
For n = 2, 


P(x) = fot wAfy + we) ee (3.3.9) 


which is an easily computable form of the quadratic interpolating polynomial. 


The forward differences are constructed in a pattern like that for divided 
differences, but now there are no divisions (see Table 3.6). 


Example The forward differences for f(x) = yx are given in Table 3.7. The 
values of the interpolating polynomial will be the same as those obtained using 
the Newton divided difference formula, but the forward difference formula (3.3.7) 
is much easier to compute. 


Example _ Evaluate p,(x) using Table 3.7 for n = 1,2, 3,4, with x = 2.15. Note 
that ¥2.15 = 1.4662878; and p = 1.5. 


p,(x) = 1.414214 + 1.5(.034924) = 1.414214 + 0.52386 = 1.4666 


(1.5)(.5) 
2 


(1.5)(.5)(—.5) 
6 


p(x) = p,(x) + (— 000822) = 1.4666 — .00030825 = 1.466292 


p(x) = p(x) + (.000055) = 1.466292 — .0000034 


= 1.466288 


(1.5)(.5)(— .5)(— 1.5) 


74 {—.000005) = 1.466288 — .00000012 


Da(x) = psx) + 


= 1.466288 
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The correction terms are easily computed; and by observing their size, you obtain 
a generally accurate idea of when the degree n is sufficiently large. Note that the 
seven-place accuracy in the table values of ¥x leads to at most one place of 
accuracy in the forward difference A*/,. The forward differences of order greater 
than three are almost entirely the result of differencing the rounding errors in the 
table entries; consequently, interpolation in this table should be limited to 
polynomials of degree less than four. This idea is given further theoretical 
justification in the next section. 


There are other forms of differences and associated interpolation formulas. 
Define the backward difference by 


vi(z) =f(z) —f(z-h) 
v¥(z)=74(z) -V F(z —-A) r>1 (3.3.10) 


Completely analogous results to those for forward differences can be derived. 
And we obtain the Newton backward difference interpolation formula, 


~» —yt1)\_. = en i ae 
P(x) = fo + ( 1 Wiha + | 4 9% Se +( - 7 Vg to 
(3.3.11) 
In this formula, the interpolation nodes are Xo, X_,, X_2,-¢+,%X—,, With x_,= 
Xq — jh, as before. The value vy is given by 


h 


pr 


reflecting the fact that x will generally be less than x) when using this formula. 
A backward difference diagram can be constructed in an analogous way to that 
for forward differences. The backward difference formula is used in Chapter 6 to 
develop the Adams family of formulas (named after John Couch Adams, a 
nineteenth-century astronomer) for the numerical solution of differential equa- 
tions. 

Other difference formulas and associated interpolation formulas can be given. 
Since they are used much less than the preceding formula, we just refer the reader 
to Hildebrand (1956). 


3.4 Errors in Data and Forward Differences 
We can use a forward difference table to detect noise in physical data, as long as 
the noise is large relative to the usual limits of experimental error. We must begin 


with some preliminary lemmas. 


Lemma 2 A’f(x;) = h'f(&;), for some x; < &; < x;4,- 
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Proof 


ONE. 
A’f, = We (Xiy 0s Xreel = wnt aps) 


using Lemma 1 and (3.2.12). Li) 


Lemma 3_ For any two functions f and g, and for any two constants @ and B, 


A’(af(x) + Bg(x)) = aAf(x) + BAg(x) r>0 


Proof The result is trivial if r = 0 or r = 1. Assume the result is true for all 
r <n, and prove it forr=n+1: 


A"**[af(x) + Bg(x)] = A"Laf(x +h) + Bg(x + h)] 
—A"[af(x) + Bg(x)] 
= ah"f{(x +h) + BA"g(x +h) 
—ah"f(x) — BA's (x) 
using the definition (3.3.3) of A"*! and the induction hypothesis. Then 
by recombining, we obtain 
al A"f(x +h) ~ A°f(x)] + B[A’g(x + A) - Arg(x)] 
= ad"*'f(x) + BAT g(x) z 
Lemma 2 says that if the derivatives of f(x) are bounded, or if they do not 
increase rapidly compared with A~”, then the forward differences A"f(x) should 


become smaller as n increases. We next look at the effect of rounding errors and 
other errors of a larger magnitude than rounding errors. Let 


f(x,) =fte(x,)  i=0,1,2,... (3.4.1) 


with f, a table value that we use in constructing the forward difference table. 


t 


Then 
Af = Af (x;) “a A’e(x;) 
= hf O(E;) ~ Ne(x;) (3.4.2) 


The first term becomes smaller as r increases, as illustrated in the earlier forward 
difference table for f(x) = yx. 

To better understand the behavior of A’e(x;), consider the simple case in 
which 


e(x;) = : ns (3.4.3) 
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Table 3.8 Forward differences of error function e(x) 


xX; e(x;) Ae(x;) A’e(x;) We(x;) 
: : 0 ‘ 0 
Xp-2 0 : 0 : 
0 : 
Xp-4 0 € 
€ —3¢ 
X, € —2e 
—€ 3¢€ 
X41 0 € 
0 —€ 
Xp42 0 0 


The forward difference of this function are given in Table 3.8. It can be proved 
that the column for A’e(x;) will look like 


02.0250 -(jJe (J) -(3)ees(-1)"%,0,... (3.4.4) 


Thus the effect of a single rounding error will propagate and increase in value as 
larger order differences are formed. 

With rounding errors defining a general error function as in (3.4.1), their effect 
can be looked upon as a sum of functions of the form (3.4.3). Since the values of 
e(x;) in general will vary in sign and magnitude, their effects will overlap in a 
seemingly random manner. But their differences will still grow in size, and the 
higher order differences of table values f, will eventually become useless. When 
differences Af begin to increase in size with increasing r, then these differences 
are most likely dominated by rounding errors and should not be used. An 
interpolation formula of degree less than r should be used. 


Detecting noise in data This same analysis can be used to detect and correct 
isolated errors that are large relative to rounding error. Since (3.4.2) says that the 
effect of the errors will eventually dominate, we look for a pattern like (3.4.4). 


The general technique is illustrated in Table 3.9. 
From (3.4.2) 


Ave(x;) = Af (x;) - Af, 
Using r = 3 and one of the error entries chosen arbitrarily, say the first, 
e = —.00002 — (.00006) = — .00008 


Try this to see how it will alter the column of A’f, (see Table 3.10). This will not 


- be improved on by another choice of e, say « = —.00007, although the results 
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Table 3.9 Example of detecting an isolated error in data 


Error Guess 
if Af, Xf Af Guess A? f(x;) 

10396 

01700 
12096 — 00014 

01686 — .00003 0 — 00003 
13782 — 00017 

01669 — .00002 0 — 00002 
15451 —.00019 

01650 .00006 € — .060002 
17101 ~ 00013 

01637 — .00025 —3e — .00002 
18738 — 00038 

01599 .00021 3¢ — 00002 
.20337 — 00017 

01582 —~ 00010 —€ — .00002 
21919 — .00027 

01555 
23474 


Table 3.10 Correcting a data error 


wf, A’e(x;) A f(x;) 
— .00002 0.0 ~ .00002 

.00006 — .00008 — .00002 
— 00025 .00024 — 00001 

.00021 — .00024 — .00003 
— 00010 .00008 — 00002 


may be equally good. Tracing backwards, the entry f, = 18738 should be 
f(x;) =f, + e(x,) = 18738 + (—.00008) = .18730 


In a table in which there are two or three isolated errors, their higher order 
differences may overlap, making it more difficult to discover the errors (see 
Problem 22). 


3.5 Further Results on Interpolation Error 


Consider again the error formula 


f() ~ pax) = BoA Ae) pone.) Bee a igi) 


(3.5.1) 
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We assume that f(x) is n + 1 times continuously differentiable on an interval / 
that contains #{x9,...,X,, x} for all values x of interest. Since £, is unknown, 
we must replace f("*)(£,) by 


rar = Maxlf"*?(2)} (3.5.2) 
rel 
in order to evaluate (3.5.1). We will concentrate our attention on bounding the 
polynomial 
W,(x) = (x — x9) --- (x — x,) (3.5.3) 
Then from (3.5.1), for 


Cn+l 


Max|f(x) — P,(x)| s ( 


———_ 5.4 
nea eR 


We will consider only the use of evenly spaced nodes: x; =x 9+ jh for 
j=0,1,...,n. We first consider cases of specific values of n, and later we 
comment on the case of general n. 

Casel n=1. V(x) = (x — x,)(x — x,). Then easily 
h2 

Max |¥,(x)| = a 
XQSXSX 


See Figure 3.2 for an illustration. 


Case2 n=2. To bound ¥,(x) on[xp, x2], shift it along the x-axis to obtain 
an equivalent polynomial 


W(x) = (x + h)x(x — h) 


whose graph is shown in Figure 3.3. Its shape and size are exactly the same as 
with the original polynomial ¥,(x), but it is easier to bound analytically. Using 


Figure3.2) y = ¥,(x). 
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Figure 3.3 y = V(x). 


(x), we easily obtain 


Max Iv, (x«)| = 375h3 
x, —-h/2sxex,th/2 


2v3 
Max |¥,(x)| = pom = .385h7 (3.5.5) 


XgSxXSX2 
Thus it doesn’t matter if x is located near x, in the interval [x9, x,], although 


it will make a difference for higher degree interpolation. Combining (3.5.5) with 
(3.5.4), 


Max |f(x) ~ p2(x)| < a Max [f®(t)| (3.5.6) 


XgSxXSxXz 


i ue 
and => = -064. 


Case 3 n= 3. As previously, shift the polynomial to make the nodes symmet- 
ric about the origin, obtaining 


Hy(x) = (x? — 3h?) (x? — 4h?) 
The graph of ¥,(x) is shown in Figure 3.4. Using this modification, 


Maximum|¥,(x)| = 224 = 0.56h* 


X,SXSX2 


Max |¥,(x)| = A4 


Xp Sx Sx; 


Thus in interpolation to f(x) at x, the nodes should be so chosen that 
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Figure 3.5 y = Y%&(x). 


xX, <x < x3. Then from (3.5.1), 


3h4 
Max ee — pilx)l 5 Fog ° Max if (r)| (3.5.7) 


xy SxSx Xo SSX, 


Case 4 For a general n> 3, the behavior just exhibited with n = 3 is 
accentuated. For example, consider the graph of ¥,(x) in Figure 3.5. As earlier, 
we can show 


Max |¥,(x)| = 12.36h' 


X2SXSXq 


Max |¥,(x)| = 95.87 


XpSxSXy 


To minimize the interpolation error, the interpolation nodes should be chosen so 
that the interpolation point x is as near as possible to the midpoint of [xp, x,]. 
As the degree n increases, for interpolation on evenly spaced nodes, it becomes - 
increasingly advisable to choose the nodes so that x is near the middle of 


[Xo, x,,]- 


Example Consider interpolation of degree five to Jo(x) at x = 2.45, with the 
values of Jo(x) given in Table 3.4 in Section 3.2. Using first x9 = 2.4, x5 = 2.9, 
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we obtain 
ps(2.45) = —0.232267384 Error = —4.9 X 107° 


Second, use x = 2.2, x, = 2.7. Then 


p;(2.45) = —0.232267423 Error = 1.0 x 107° 


The error is about five times smaller, due to positioning x near the middle of 
[Xo; Xs]. 


The tables in Abramowitz and Stegun (1964) are given to many significant 
digits, and the grid spacing A is not correspondingly small. Consequently, 
high-degree interpolation must be used. Although this results in more work for 
the user of the tables, it allows the table to be compacted into a much smaller 
space; more tables of more functions can then be included in the volume. When 
interpolating with these tables and using high-degree interpolation, the nodes 
should be so chosen that x is near (x9 + x,,)/2, if possible. 


The approximation problem In using a computer, we generally prefer to store 
an analytic approximation to a function rather than a table of values from which 
we interpolate. Consider approximating a given function f(x) on a given interval 
[a, b] by using interpolating polynomials. In particular, consider the polynomial 
P,(x) produced by interpolating f(x) on an evenly spaced subdivision of [a, 5]. 

For each n > 1, define h = (b — a)/n, x; = a + jh, j = 0,1,..., . Let p,(x) 


be the polynomial interpolating to f(x) at xo,..., x,. Then we ask whether 
Max if(x) ~ p,(x) (3.5.8) 


tend to zero as n -> 00? The answer is not necessarily. For many functions, for 
example, e* on [0,1], the error in (3.5.8) does converge to zero as n ~ 00 (see 
Problem 24). But there are other functions, that are quite well behaved, for which 
convergence does not occur. 

The most famous example of failure to converge is one due to Carl Runge. Let 


f(x) = -5<x<5 (3.5.9) 


1+ x? 


In Isaacson and Keller (1966, pp. 275-279), it is shown that for any 
3.64 < |x| <5, 


Supremum |f(x) —p,(x)| = 0 and k20 (3.5.10) 
n=k 
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Figure 3.6 Interpolation to 1/(1 + x?). 


Thus p,(x) does not converge to f(x) as n — oo for any such x. This is at first 
counterintuitive, but it is based on the behavior of the polynomials y = V,,(x) 
near the endpoints of [a, b] = [xo, x,,]. This is further illustrated with the graphs 
of f(x) and p,o(x), given in Figure 3.6. Although interpolation on an even grid 
may not produce a convergent sequence of interpolation polynomials, there are 
suitable sets of grid points {x,} that do result in good approximations for all 
continuously differentiable functions. This grid is developed in Section 4.7 of 
Chapter 4, which is on approximation theory. 


3.6 Hermite Interpolation 


For a variety of applications, it is convenient to consider polynomials p(x) that 
interpolate a function f(x), and in addition have the derivative polynomial p’(x) 
interpolate the derivative function f’(x). In this text, the primary application of 
this is as a mathematical tool in the development of Gaussian numerical 
integration in Chapter 5. But such interpolation is also convenient in developing 
numerical methods for solving some differential equation problems. 

We begin by considering an existence theorem for the basic interpolation 
problem 


P(x) =); P'(x;) =¥/ i=1,...,0 (3.6.1) 


in which x,,..., x, are distinct nodes (real or complex) and y,,..., Yar Vis-++> Ye 
are given data (The notation has been changed from n + 1 nodes {xo,..., x, } to 
n nodes {x,,..., x,,} im line with the eventual application in Chapter 5.) There 
are 2n conditions imposed in (3.6.1); thus we look for a polynomial p(x) of at 
most degree 2n — 1. 
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To deal with the existence and uniqueness for p(x), we generalize the third 
proof of Theorem 3.1. In line with previous notation in Section 3.1, let 


¥, (x) = (x — 4) ++ (x — x) 


x) = Ee een ti) a) 2) 


7 (xp aq) 8 (x; = KE aay) (k= &) 7 (x - x,)¥f(x,) 


h,(x) = (x- x;)[4i(x)]? 


(3.6.2) 
hj(x) = (1 — 2 (x,)(x x [L(x]? 
Then for i, j = 1,...,4, 
hi(x,) = h,(x,) = 0 l<i,j<n 
ence maa (3.6.3) 
ae el 1 i=j om 
The interpolating polynomial for (3.6.1) is given by 
H,(x) = Yo yh(x) + LX vh(x) (3.6.4) 
i=l i=l 


To show the uniqueness of H,(x), suppose there is a second polynomial G(x) 
that satisfies (3.6.1) with degree < 2n — 1. Define R = H — G. Then from (3.6.1) 


R(x,)=R(x,)=0 i=1,2,....0 


where R is a polynomial of degree < 2m — 1, with n double roots, x,, x,..., x 
This can be true only if 


n’ 


R(x) = g(x)(x — x)? ++ (x - 2)” 


for some polynomial q(x). If q(x) # 0, then degree R(x) = 2n, a contradiction. 
Therefore we must have R(x) = 0. 

To obtain a more computable form than (3.6.4) and an error term, first 
consider the polynomial interpolating f(x) at nodes z,, Z2,..., Z2,,, Written in the 
Newton divided difference form: 


Py al) =) * = 2) flay 
+Q22)@=2,4)/ixesznd 56-65) 
For the error, 


fle) ~tea) = G2) ave ia seers Zan,X] (3.6.6) 
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In the formula (3.6.5), we can let nodes coincide and the formula will still exist. 
In particular, let 


24,2, 7X, 23, 24 > XQ0--+5 Zan 19 Zan PO Xy 
to obtain 
Pon-lx) = fq) + (2 — 2) fo on] + (x - 34)? f os ae] + 
A(x = cas ed ee eae: “KIS [Ris Xigek a peeel Ged) 
This is a polynomial of degree < 2n — 1. For its error, take limits in (3.6.6) as 


24,29 > Xy,-++5 229-1) 22 > X,- By the continuity of the divided difference, 
assuming f is sufficiently differentiable, 


F(x) = Ponaa(x) = (x - x)" a ©, oe xq) fx, Kissa ys Reick) 43-6.8) 


Claim: p,,,-,(x) = H,(x). To prove this, assume f(x) is 2m + 1 times continu- 
ously differentiable. Then note that 


f(%;) — Pon-1(%;) = 0 i=1,2,...,0 


Also, 


d 
$'() = Phyl) = (x Pe (8 a) Sf Deas tise Fa Ea 


n n : 
SAC ieee oe ae > eee 4): 
i=l i= 


S#i 
and 
f'(%;) — Pin—1(%;) = 0 i=1,...,7 


Thus degree ( p,,_,;) < 2n — 1 and it satisfies (3.6.1) relative to the data y, = 
S(x;), ¥/ = f'(x;). By the uniqueness of the Hermite interpolating polynomial, 
we have p,,,_,; = H,. Thus (3.6.7) gives a divided difference formula for calculat- 
ing H,(x), and (3.6.8) gives an error formula: 


f(x) — Hy(x) = [8 (~)]°F Leas xy ne Xn XI (3.6.9) 


Using (3.2.13) to generalize (3.2.12), we obtain 


(2n) 
f(x) — H,(x) = oor EE H?(x1,..-,%_, x} (3.6.10) 
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Example The most widely used form of Hermite interpolation is probably the 
cubic Hermite polynomial, which solves 


p(a)=f(a)  p'(a) = f(a) 
p(b)=f(b) — p'(b) = f(b) (3.6.11) 


The formula (3.6.4) becomes 


ih(x)=| fr ‘fla 1425 ==" P p00) 
Ca Ha = ‘cian (3.6.12) 


The divided difference formula (3.6.7) becomes 


H,(x) = f(a) + (x— a) f(a) + (x - a)’f[a, a, 8] 


+(x —a)(x — b)f[a, a,b, b] (3.6.13) 
in which 
fla, a, bj = Me.bh ie) 
f'(b) — 2f[a, b] + f(a) 
a,a,b, Sa 
Fa, 6,6} = Oe 


The formula (3.6.13) can be evaluated by a nested multiplication algorithm 
analogous to /nterp in Section 3.2. 
The error formula for (3.6.12) or (3.6.13) is 


f(x) — H,(x) = (x - a)*(x— 6) fla, a, b, 6, x] (3.6.14) 


(x-a)(x-b) 
= [7 aa es £,€#{a,b, x} 
(b— a)" 
384 


Max |f(x) — H2(x)| s » Max |f(2)) (3.6.15) 


Further use will be made of the cubic Hermite polynomial in the next section, 
and a numerical example is given in Table 3.12 of that section. 
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The general Hermite interpolation problem We generalize the simple Hermite 
problem (3.6.1) to the following: Find a polynomial p(x) to satisfy 


+ 


P™\xn)=y 1 =0,1,...,a,-1 


(3.6.16) 


p(x, ) = yf i=0,1,...,0,-~1 


The numbers y(") are given data, and the number of conditions on p(x) at x; is 
aj, f= 1,..., 9m. a‘ 
Define 


N=a, +--+: +a, 
Then there is a polynomial p(x), unique among those of degree < N — 1, which 
satisfies (3.6.16). The proof is left as Problem 25. All of the earlier results, such as 
(3.6.4) and (3.6.9), can be generalized. As an interesting special case, consider 
a, = N, n= 1. This means that p(x) is to satisfy 


P™(4)= fm)  §=OL..,N-1 


We have replaced y{ by f(x,) for notational convenience. Then using 
(3.2.15), the Newton divided difference form of the Hermite interpolating poly- 
nomial is 


p(x) =/(x Jes) ies] 4 = xi) fli; X15 x] Se 
+(x — x)" ff ees 4] 


(x- x4) 


= f(x) + (x - x) f(y) + la) es 


(x= a)" N-D/- 
Na (x,) 


which is also the Taylor polynomial of f about x,. 


3.7 Piecewise Polynomial Interpolation 


Since the early 1960s, the subject of piecewise polynomial functions has become 
incresingly popular, especially spline functions. These polynomial functions have 
been used in a large variety of ways in approximation theory, computer graphics, 
data fitting, numerical integration and differentiation, and the numerical solution 
of integral, differential, and partial differential equation. We look at piecewise 
polynomial functions from only the viewpoint of interpolation theory, but much 


164 INTERPOLATION THEORY 


of their useful application occurs in some other area, with interpolation occurring 
only in a peripheral way. 
For a piecewise polynomial function p(x), there is an associated grid: 


—0 <Xg<xX,< -+° <x,< 00 (3.7.1) 


The points x, are sometimes called knots, breakpoints, or nodes. The function 
p(x) is a polynomial on each of the subintervals 


(—00, Xo], [xo, x,],-.-» [x,, 0) Ba 


although often the intervals (— 00, x9] and [x,, co) are not included. We-say p(x) 
is a piecewise polynomial of order r if the degree of p(x) is less than r on each of 
the subintervals in (3.7.2). No restrictions of continuity need be placed on p(x) 
or its derivatives, although usually p(x) is continuous. In this section, we mostly 
restrict ourselves to piecewise cubic polynomial functions (of order four). This is 
the most common case in applications, and it simplifies the presentation to be 
definite as to order. 

One way of classifying piecewise polynomial interpolation problems is as /ocal 
or global. For the local type of problem, the polynomial p(x) on each subinterval 
{x;_,, x;] is completely determined by the interpolation data at node points 
inside and neighboring [x,_;, x,]. But for a global problem, the choice of p(x) on 
each [x,_,, x,] is dependent on all of the interpolation data. The global problems 
are somewhat more complicated to study; the best known examples are the spline 
functions, to be defined later. 


Local interpolation problems Consider that we want to approximate f(x) on an 
interval [a, b]. We begin by choosing a grid 


A=X9<x,< ++: <x, =5b (3.7.3) 


often evenly spaced. Our first case of a piecewise polynomial interpolation 
function is based on using ordinary polynomial interpolation on each subinterval 
[x f-1L x;]. . 

Let four interpolation nodes be given on each subinterval [x,;_,, x;}, 
i=1,...,2 (3.7.4) 


Xia S241 S 25,2 S 21,3 S 41,4 3% 


Define p(x) on x;_, < x < x; by letting it be the polynomial of degree < 3 that 
interpolates f(x) at z,1,-.-, 2;,4- If 
Zy=%i-1 00 24%; §=1,2,...,0 (3.7.5) 
then p(x) is continuous on [a,b]. For the nodes in (3.7.4), we call this 
interpolating function the Lagrange piecewise polynomial function, and it will be 
denoted by L, (x). 
From the error formula (3.1.8) for polynomial interpolation, the error formula 
for L,,(x) satisfies 


(x — 2,1)(x — 2,.2)(% — 2;,3)(* — 2,4) 


f(x) - L,() = A 


FOE) (3.7.6) 
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for x,;_, <x <x, i= 1,....n, with x,_, < & <x,. Consider the special case of 
even spacing. Let 


= ( 26)54) 


; ; =x14¢(7-1)6 jf =1,2,3,4 (3.7.7) 


2H 


Then (3.5.7) and (3.7.6) yield 


54 
f(x) - L,(x)| < mi - Max = |f(1) Kg <x <x, (3:78) 


Xj; St sx; 


for i = 1,2,..., . From this, we can see that to maintain an equal level of error 
throughout a < x < b, the spacing 8, should be chosen based on the size of the 
derivative f(t) at points of [x,_, x,]. Thus if the function f(x) has varying 
behavior in the interval [a, b], the piecewise polynomial function L(x) can be 
chosen to mimic this behavior by suitably adjusting the grid in (3.7.3). Ordinary 
polynomial interpolation on [a, b] is not this flexible, which is one reason for 
using piecewise polynomial interpolation. For cases where we use an evenly 
spaced grid (3.7.3), the result (3.7.8) guarantees convergence where ordinary 
polynomial interpolation-may fail, as with Runge’s example (3.5.9). 


Example For f(x) = e* on [0,1], suppose we want even spacing and a maxi- 
mum error less than 10-°. Using 6, = 6 in (3.7.8), we require that 


4 
—e<10~° 
mes! 


1 
6 < .055 n= 35 26.12 


It will probably be sufficient to use = 6 because of the conservative nature of 
the bound (3.7.8). There will will be six subintervals of cubic interpolation. 


For the storage requirements of Lagrange piecewise polynomial interpolation, 
assuming (3.7.7), we need to save four pieces of information on each subinterval 
{x;.,, X,]. This gives a total storage requirement of 4m memory locations for 
L,,(x), as well as an additional n — 1 locations for the breakpoints of (3.7.3). The 
choice of information to be saved depends on how L,(x) is to be evaluated. If 
derivatives of L,,(x) are desired, then it is most convenient to store L,(x) in its 
Taylor polynomial form on each subinterval [x jap xy) 


L(x) = a, + b,(x- iy + ej(x - tai) + d(x - ay) 


x 


jer SXS4x, (3.7.9) 


This should be evaluated in nested form. The coefficients {a,, bj, Ci, d i} are 
easily produced from the standard forms of the cubic interpolation polynomial. 
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A second widely used piecewise polynomial interpolation function is based on 
the cubic Hermite interpolation polynomial of (3.6.12)-(3.6.15). On each subin- 
terval [x,_,, x;], let Q,,(x) be the cubic Hermite polynomial interpolating f(x) at 
x;_, and x, The function Q,(x) is piecewise cubic on the grid (3.7.3), and 
because of the interpolation conditions, both Q,(x) and Q/(x) are continuous 
on [a, 5]. Thus Q,,(x) will generally be smoother than L,(x). 

For the error in Q,(x) on [x;_,, x,], use (3.6.15) to get 


4 


W(x) - Q,(x)) sao Max (x, sx <x, (3.7.10) 


384 x, s1sx, 


with h; = x; — x;_,, 1 <i <n. When this is compared to (3.7.8) for the error in 
L,(x), it might seem that piecewise cubic Hermite interpolation is superior. This 
is deceptive. To see this more clearly, let the grid (3.7.3) be evenly spaced, with 
X; 7 X;., =A, all i. Let L,(x) be based on (3.7.7), and let 6 = A/3 in (3.7.8). 
Note that Q,(x) is based on 2n +2 pieces of data about f(x), namely 
(f(x, f’Ce)| 6 = 0,1,..., 2}, and L,(x) is based on 3n + 1 pieces of data 
about f(x). Equalize this by comparing the error for L, (x) and Q, (x) with 
n, = 1.5n,. Then the resultant error bounds from (3.7.8) and (3.7.10) will be 
exactly the same. 

Since there is no difference in error, the form of piecewise polynomial function 
used will depend on the application for which it is to be used. In numerical 
integration applications, the piecewise Lagrange function is most suitable; it is 
also used in solving some singular integral equations, by means of the product 
integration methods of Section 5.6 in Chapter 5. The piecewise Hermite function 
is useful for solving some differential equation problems. For example, it is a 
popular function used with the finite element method for solving boundary value 
problems for second-order differential equations; see Strang and Fix (1973, chap. 
1). Numerical examples comparing L,(x) and Q,(x) are given in Tables 3.11 
and 3.12, following the introduction of spline functions. 


Spline functions As before, consider a grid 
G=Xg<x,< -++ <x, =5 


We say s(x) is a spline function of order m > 1 if it satisfies the following two 
properties: 


(P1) s(x) is a polynomial of degree < m on each subinterval [x;_,, x;]. 
(P2) s(x) is continuous on [a, b], forO < r< m-—2. 


The derivative of a spline of order m is a spline of order m — 1, and similarly for 
antiderivatives. If the continuity in P1 is extended to s‘"~")(x), then it can be 
proved that s(x) is a polynomial of degree < m — 1 on [a, 5] (see Problem 33). 

Cubic spline functions (order m = 4) are the most popular. spline functions, 
for a variety of reasons. They are smooth functions with which to fit data, and 
when used for interpolation, they do not have the oscillatory behavior that is 
characteristic of high-degree polynomial interpolation. Some further motivation 
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for particular forms of cubic spline interpolation is given in (3.7.27) and in 


Problem 38. 
For our interpolation problem, we wish to find a cubic spline s(x) for which 


s(x)=y 1=0,1,...,7 (3.7.11) 


We begin by investigating how many degrees of freedom are left in the choice of 
s(x), once it satisfies (3.7.11). The technique used does not lead directly to a 
practical means for calculating s(x), but it does furnish additional insight. Write 


s(x) a, +bxe+ex7 +d? xoypsxee, f= 1,...,0 (3.712) 


There are 4n unknown coefficients {a,, b;,c;, d;}. The constraints on s(x) are 
(3.7.11) and the continuity restrictions from P2, 


sD(x,+ 0) =sM(x,-0) f=1,...,2-1 jf=0,1,2 (3.7.13) 


Together this gives 
n+1+3(n-1)=4n-2 


constraints, as compared with 4n unknowns. Thus there are at least two degrees 

of freedom in choosing the coefficients of (3.7.12). We should expect to impose 

extra conditions on s(x) in order to obtain a unique interpolating spline s(x). 
We will now give a method for constructing s(x). Introduce the notation 


M,=s"(x,) i =0,1,...,n (3.7.14) 
Since s(x) is cubic on [x;, x;,,], 5’(x) is linear and thus 


: (Xai i x) M,; + (x= x) Miay 


s"(x) i i=0,1y.:.,07- 1 (3.715) 


where A, = x;,, — x;. With this formula, s(x) is continuous on [Xo, x,]. 
Integrate twice to get 
3 3 
Xia. — X) M, + (x — x,)'M; 
s(x) = saa ala + C(x;,, — x) + D(x - x;) 


with C and D arbitrary. The interpolating condition (3.7.11) implies 


Jj h,M; D= Vit _ hiMi4, 
h, 6 h. 6 


t Ul 


ax) = (xi43 _ x)'M, + (x iz x;) Misa i (Xi41 a x); + (x- Xi) Mia 
6h h, 


h; : 
~ ralca a x)M, st Ce x;) Misi | 


X,SxXSxX,, Osisn-1 (3.7.16) 
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This formula implies the continuity on [a, b] of s(x), as well as the interpolating 
condition (3.7.11). To determine the constants M),..., M,, we require s’(x) to 
be continuous at x,,...,X,—1! 


Limits‘(x) = Limits’(x) PS] Laon st (3.7.17) 


XNX; XAX; 


On [x;, X;41], 


and on [x;_, x;], 


~(x;,— x)°M,_, + (x — x;1)'M, Yi Yi-r (M, — M,_,)hj-1 


, = + i ta i 
a) 2h; hi 6 
Using (3.7.17) and some manipulation, we obtain 
hina h,+ hj, h; Viti 7 Vi Vi 7 Vir 
M,_, + —>——- M+ —M,,, = — - os O03.7.19 
6 i-l 3 t i+1 h hy ( ) 


for i=1,...,n~—1. This gives n-—1 equations for the n+ 1 unknowns 
M,,.-., M,. We generally specify endpoint conditions, at x, and x,, to remove 
the two degrees of freedom present in (3.7.19). 


Case 1 Endpoint derivative conditions. Require that s(x) satisfy 
S(xo)=yo 8x.) =I, (3.7.20) 


with yj, yf given constants. Using these conditions with (3.7.18), for i = 0 and 
i =n — 1, we obtain the additional equations 


ho hy Yi — Yo 
iM; + - yf 
gree og ead ho Yo 
hy-i et ; Vn 7 Vn-1 
6 M,,-1 3 M, = Yn h 


Combined with (3.7.19), we have a system of linear equations 


AM =D (3.7.21) 
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with 
Yi — Yo Ya7hi Yi 7 Yo 
DT = — ¥; -——,..., 
ke ho 
Jn ~Vn-1 Yn-1 ~ Yn-2 faa Vn —~Vn-1 
hy hy-2 i Ms hy-y 
MT =[My, My. My] 
ho ho 
—_— > 0 0 0 
3 6 
hyo ho thy hy 
6 3 6 
F h, hy +h, hy 
A= 6 3 6 
0 
h,-2 h,_2 + Any hy-y 
6 3 6 
0 id h,-y 
6 3 
(3.7.22) 


This matrix is symmetric, positive definite, and diagonally dominant, and the 
linear system AM = D is uniquely solvable. This system can be solved easily and 
rapidly, using about 87 arithmetic operations. (See the material on tridiagonal 
systems in Section 8.3.) 

The resulting cubic spline function s(x) is sometimes called the complete cubic 
spline interpolant, and we denote it by s(x). An error analysis of it would require 
too extensive a development, so we just quote results from de Boor (1978, pp. 
68-69). 


Theorem 3.4 Let f(x) be four times continuously differentiable for a < x < b. 
Let a sequence of partitions be given, 


Ti A@= XM <oxM <0. <xM=h 
and define 
It] = Max (x{" — xf") 
lsisn 


Let s, ,(x) be the complete cubic spline interpolant of f(x) on 
the partition 1,: 


se n(xf?) =f(x) = 0,1,...,0 
si(a)=f(a) sZ,(b) = f(b) 
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Then for suitable constants Ci, 


a@sxs 


Max |f(x) = sy i(x)s elt * Max If (x) (3.7.23) 
asxs 


for j = 0,1, 2. With the additional assumption 


Supremum ec eee < 0 
je Min (x = xi) 


lsisn 
the result (3.7.23) also holds for j = 3. Acceptable constants are 


5 1 3 
So Ng es (3.7.24) 


0 
Proof The proofs of most of these results can be found in de Boor (1978, pp. 
68-69), along with other results on s, ,(x). 


Letting j = 0 in (3.7.23), we see that for a uniform grid 7,, the rate of 
convergence is proportional to 1/n‘*. This is the same as for piecewise cubic 
Lagrange and Hermite interpolation, but the multiplying constant cy is smaller 
by a factor of about 3. Thus the complete spline interpolant should be a 
somewhat superior approximation, as the results in Tables 3.11—3.14 bear out. 

Another motivation for using s,(x) is the following optimality property. Let 
g(x) be any function that is twice continuously differentiable on [a, 6], and 
moreover, let it satisfy the interpolating conditions (3.7.11) and (3.7.20). Then 


fiseor dx < f'lg’(x)|? dx (3.7.25) 


with equality only if g(x) = s5,(x). Thus s.(x) “oscillates least” of all smooth 
functions satisfying the interpolating conditions (3.7.11) and (3.7.20). To prove 
the result, let k(x) = 5,(x) — g(x), and write 


fig’)? ax = [ise() — k"(x)I? dx 


= f\se(x)P dx - 2f’se"(x)k"(x) de + [ike CoP ax 


By integration by parts, and using the interpolating conditions and the properties 
of s.(x), we can show 


[Pse(a)k"(x) dx =0 (3.7.26) 
and thus 


Pie" xy dx = fis)? dx + f'ise(x) - "(xP dx (3.7.27) 
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This proves (3.7.25). Equality in (3.7.25) occurs only if s/’(x) — g’(x) = 0 on 
[a,b], or equivalently s.(x) — g(x) is linear. The interpolating conditions then 
imply s(x) — g(x) = 0. We leave a further discussion of this topic to Problem 
38. 


Case 2 The “not-a-knot” condition. When the derivative values f’(a) and 
f’(b) are not available, we need other end conditions on s(x) in order to 
complete the system of equations (3.7.19). This is accomplished by requiring 
s(x) to be continuous at x, and x,_,. This is equivalent to requiring that s(x) 
be a cubic spline function with knots {xX9, x2, X3,.--,X,-2,%,}, While still 
requiring interpolation at all node points in { Xo, x), Xz,...,X,-1,X,}. This 
reduces system (3.7.19) to n — 3 equations, and the interpolation at x, and x,_, 
introduces two new equations (we leave their derivation to Problem 34). Again 
we obtain a tridiagonal linear system AM = D, although the matrix A does not 
possess some of the nice properties of that in (3.7.22). The resulting spline 
function will be denoted here by s,,(x), with the subscript indicating the 
“not-a-knot” condition. A convergence analysis can be given for s,,(x), similar 
to that given in Theorem 3.4. For a discussion of this, see de Boor (1978, p. 211), 
(1985). 

There are other ways of introducing endpoint conditions when f’(a) and 
f(b) are unknown. A discussion of some of these can be found in de Boor (1978, 
p. 56). However, the preceding scheme is the simplest to apply, and it is widely 
used. In special cases, there are simpler endpoint conditions that can be used 
than those discussed here, and we take up one of these in Problem 38. In general, 
however, the preceding type of endpoint conditions are needed in order to 
preserve the rates of convergence given in Theorem 3.4. 


Numerical examples Let f(x) = tan7' x, 0 < x <5. Table 3.11 gives the er- 
rors 


i 


E,= Max [f(x)—L0(x)) i= 0,1,2,3 (3.7.28) 
Sx 


where L,(x) is the Lagrange piecewise cubic function interpolating f(x) on the 
nodes x, = a+ jh, j= 0,1,...,1, h =(b—- a)/n. The columns labeled Ratio 


Table 3.11 Lagrange piecewise cubic interpolation: L,(x) 


n Ey Ratio . Ey Ratio E, Ratio E; Ratio 
2 1.20E-—2 1.22E ~— 1 7.81E — 1 2.32 
3.3 2.1 1.5 1.2 
4 3.62E—3 5.83E — 2 5.24E — 1 1.95 
11.4 6.1 3.2 1.6 
8 3.18E-4 9.57TE — 3 1.64E -1 1.19 
16.9 8.1 3.9 1.7 
16 =188E—5 LIE — 3 4.21E —2 682 
14.5 73 3.7 1.9 


32 =1.30E — 6 161E — 4 1.14E — 2 359 
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Table 3.12 Hermite piecewise cubic interpolation: @ (x) 


n Ey Ratio E, Ratio E, Ratio E, Ratio 
3. 2.64E — 2 5.18E — 2 4.92E — 1 2.06 
5.6 3.0 2.3 1.5 
6 473E —3 1.74E — 2 2.14E - 1 1.33 
16.0 8.0 3.6 1.5 
12 2.95E-—4 217E — 3 5.91E — 2 891 
13.1 6.7 3.6 1.9 
24 2.26E-—5 3.25E — 4 1.66E — 2 ATS 
16.0 8.0 4.0 2.0 
48 1.41E —6 4.06E — 5 4.18E — 3 241 


give the rate of decrease in the error when n is doubled. Note that the rate of 
convergence for L“ to f is proportional to h*~‘, i = 0,1,2,3. This can be 
rigorously proved, and an indication of the proof is given in Problem 32. 

Table 3.12 gives the analogous errors for the Hermite piecewise cubic function 
Q,,(x) interpolating f(x). Note that again the errors agree with 


Max [f'(x) ~ Q(x)} < cht! i= 0,1,2,3 
asxs 


which can also be proved, for some c > 0. 

As was stated earlier following (3.7.10), the functions L,(x) and Q,,(x), 
m = 1.5n, are of comparable accuracy in approximating f(x), and Tables 3.11 
and 3.12 confirm this. In contrast, the derivative Q/y,(x) is a more accurate 
approximation to f’(x) than is L/(n). An explanation is given in Problem 32. 

In Table 3.13, we give the results of using the complete cubic spline inter- 
polant s.(x). To compare with L,(x) and Q,,(x) for comparable amounts of 
given data on f(x), we use the same number of evenly spaced interpolation 
points as used in L,(x). 


Example Another informative example is to take f(x) = x*,0 < x <1. All of 
the preceding interpolation formulas have f(x) as a multiplier in their error 


Table 3.13 Complete cubic spline interpolation: s, (x) 


Eo Ratio Ey Ratio E, Ratio E, Ratio 
6 7.09E — 3 2.45E — 2 1.40E — 1 1.06E0 

21.9 10.7 4.8 2.6 
12 3.24E — 4 2.28E — 3 2.90E ~ 2 4.09E ~ 1 

10.6 5.6 2.9 1.6 
24 «3.06E-—5 4.09E — 4 9.84E — 3 2.53E —1 

20.7 9.7 4.6 hel 
48 1.48E — 6 4.22E — 5 2.13E — 3 1,22R — 1 

16.4 8.1 4.0 2.0 


96 9.04E — 8 5.19E — 6 5.30E ~ 4 6.09E — 2 
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Table 3.14 Comparison of three forms of piecewise cubic interpolation 


Method  & +E EE 
Lagrange L10E — 8 6.78E — 6 2.93E — 3 375 
n = 32 

Hermite 1.18E — 8 L.74E — 6 8.68E — 4 .250 
n = 48 

Spline 7.36E — 10 2.12E — 7 1.09E -— 4 .0625 
n= 96 


formulas. Since f(x) = 24, a constant, the error for all three forms of interpo- 
lation satisfy 
Max |x*—f,(x)| =ch* 4 = 0,1,2,3 (3.7.29) 
O<x<l 
The constants c, will vary with the form of interpolation being used. In the actual 
computations, the errors behaved exactly like (3.7.29), thus providing another 
means for comparing the methods. We give the results in Table 3.14 for only the 
most accurate case. 


These examples show that the complete cubic interpolating spline is more 
accurate, significantly so in some cases. But the examples also show that all of the 
methods are probably adequate in terms of accuracy, and that they all converge 
at the same rate. Therefore, the decision as to which method of interpolation to 
use should depend on other factors, usually arising from the intended area of 
application. Spline functions have proved very useful with data fitting problems 
and curve fitting, and Lagrange and Hermite functions are more useful for 
analytic approximations in solving integral and differential equations, respec- 
tively. All of these forms of piecewise polynomial approximation are useful with 
all of these applications, and one should choose the form of approximation based 
on the needs of the problem being solved. 


B-splines One way of representing cubic spline functions is given in 
(3.7.12)—(3.7.13), in which a cubic polynomial is given on each subinterval. This 
is satisfactory for interpolation problems, as given in (3.7.16), but for most 
applications, there are better ways to represent cubic spline functions. As before, 


we look at cubic splines with knots { x9, x;,-..,X,}- 
- Define 
r,_f0O x<0 
x" Ss — (3.7.30) 


This is a spline of order r + 1, and it has only the one knot x = 0. This can be 
used to give a second representation of spline functions. Let s(x) be a spline 
function of order m with knots {X9,...,x,,}. Then for xy < x < x,, 


5(x) = Pp—i(x) + FE B(x - a ia (3.7.31) 


jal 
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with p,,_,(x) a uniquely chosen polynomial of degree < m — land f,,...,B,_, 
uniquely determined coefficients. The proof of this result is left as Problem 37. 
There are several unsatisfactory features to this representation when applying it 
to the solution of other problems. The most serious problem is that it often leads 
to numerical schemes that are ill-conditioned. For this reason, we introduce 
another numerical representation of s(x), one that is much better in its numerical 
properties. To simplify the presentation, we consider only cubic splines. 
We begin by augmenting the knots {Xo,..., X,,}. Choose additional knots 


Foy See ey Say Roe ee = Kees (3.7.32) 
in some arbitrary manner. For i = —3, —2,..., — 1, define 
B(x) = (Xing - x) fx, Xie 1X i421 Xiagy Sal (3.7.33) 


a fourth-order divided difference of 


f,.(t) = (t- x) (3.7.34) 


The function B,(x) is called a B-spline, which is short for basic spline function. 
As an alternative to (3.7.33), apply the formula (3.2.5) for divided differences, 
obtaining 


B(x) = (Sad aa ye a 


W(x) = (x — x, )(% ~ 54. )(% ~ Xi42)(% — Xi43)(% ~ Xi44) (3.7.35) 
This shows B,(x) is a cubic spline with knots x,,...,X;.4- A graph of a typical 
B-spline is shown in Figure 3.7. We summarize some important properties of 


B-splines as follows. 


Theorem 3.5 The cubic B-splines satisfy 


(a) B,(x) = 0 outside of. x; < x < xX;44; (3.7.36) 
(b) 05 B(x) <1 all x; (3.7.37) 
x 
Xo X2 xq 


Figure 3.7. The B-spline B,(x). 
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n-l 
© Bois) «eee e- (3.7.38) 
i=-3 
Xpand x; _— x; 
(d) | B(x) dx = el ie (3.7.39) 
: 4 
(e) If s(x) is a cubic spline function with knots {x9,..., x, }; 
then for xy < x < x,, 
n-l 
s(x)= Yo a,B,(x) (3.7.40) 
i=-3 
with the choice of a_3,...,a@,_, unique. 


Proof (a) Forx < x,, the function f,(t) is a cubic polynomial for the interval 
X;<t<-x,,,. Thus its fourth-order divided difference is zero. For 
xX > X;,,4, the function f,(t) = 0 for x;< t < x;,,, and thus B,(x) = 0. 


(b) See de Boor (1978, p. 131). 


(c) Using the recursion relation for divided differences, 
BAx) =f, i415 X1425 Xi430 X44] — AL p Xen Xi42 X43] (3.7.41) 


Next, assume x, < x < xX,,,. Then the only B-splines that can be 
nonzero at x are B,_4(x), B,_2(x),..., B,(x). Using (3.7.41), 


n-l k 
eS b(x) = s EM) 


i=-3 imk- 
k . 
a os Cala X i422 %i43> Xizval ie Ae Xit.s Xi42> X44) 
=k- 


=f, [Xna1 Xee29Xka32 Xpeal _ obser Xp-2>%K-15 x,] 


The last step uses (1) the fact that f(z) 1s cubic on [x,43, X,44], 80 
that the divided difference equals 1, from (3.2.18); and (2) f.(t) = 0 
on [X,_3, X¢]- 


(d) See de Boor (1978, p. 151). 


(e) The concept of B-splines originated with J. J. Schoenberg, and the 
result (3.7.40) is due to him. For a proof, see de Boor (1978, p. 113). 
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Because of (3.7.36), the sum in (3.7.40) involves at most four nonzero terms. 
For x, 5 X < Xpay, 


s(x) = > «a, B;(x) (3.7.42) 
imk-3 


in addition, using (3.7.37) and (3.7.38), 
Min {o, = go = 21 Op - ty o,} — S(x) = Max {ox, ae Ay 2, hy oy} 


showing ‘that the value of s(x)-is bounded_by coefficients for B-splines near to x. 
In this sense, (3.7.40) is a local representation of s(x), at each x & [Xo, x,]. 

A more general treatment of B-splines is given in de Boor (1978, chaps. 9-11), 
along with further properties omitted here. Programs are also given for comput- 
ing with B-splines. 

An important generalization of splines arises when the knots are allowed to 
coalesce. In particular, let some of the nodes in (3.7.33) become coincident. Then, 
so as long as x, < x;,4, the function B,(x) will be a cubic piecewise polynomial. 
Letting two knots coalesce will reduce from two to one the number of continuous 
derivatives at the multiple knot. Letting three knots coalesce will mean that 
B(x) will only be continuous. Doing this, (3.7.40) becomes a representation for 
all cubic piecewise polynomials. In this scheme of things, all piecewise poly- 
nomial functions are spline functions, and vice versa. This is fully explored in 
de Boor (1978). 


3.8 Trigonometric Interpolation 


An extremely important class of functions are the periodic functions. A function 
f(t) is said to be periodic with period + if 


f(t+r) =f(t) —0o<t<o 


and this is not to be true for any smaller positive value of 7. The best known 
periodic functions are the trigonometric functions. Periodic functions occur 
widely in applications, and this motivates our consideration of interpolation 
suitable for data derived from such functions. In addition, we use this topic to 
introduce the fast Fourier transform (FFT), which is used in solving many 
problems that involve data from periodic functions. 

By suitably scaling the independent variable, it is always possible to let 
t+ = 2a be the period: 


f(t + 27) = f(t) —~-o<t< 0 (3.8.1) 


We approximate such functions f(t) by using trigonometric polynomials, 


p,(t) = ay + 3a, cos( jt) + bysin (jt) (3.8.2) 
fal 
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If |a,| + |b,| # 0, then this function p,(t) is called a trigonometric polynomial of 
degree n. It can be shown by using trigonometric addition formulas that an 
equivalent formulation is 


p(t) = a+ x a, [cos (1)]/ + B, [sin (1)]/ (3.8.3) 


j=l 


thus partially explaining our use of the word polynomial for such a function. The 
polynomial p,(t) has period 27 or integral fraction thereof. 

To study interpolation problems with p,(t) as a solution, we must impose 
2n + 1 interpolating conditions, since p,(t) contains 2n + 1 coefficients a,, b,. 
Because of the periodicity of the function f(t) and the polynomial p,(t), we also 
require the interpolation nodes to lie in the interval 0 < ¢ < 2a (or equivalently, 
—a<t<aor0Q <1? < 27). Thus we assume the existence of the interpolation 
nodes 


O45 <1 Ses Sea (3.8.4) 


and we require p,(t) to be chosen to satisfy 
PrAt;) =f(t,)  i=0,1,...,2n (3.8.5) 


It is shown later that this problem has a unique solution. 

This interpolation problem has an explicit solution, comparable to the Lagrange 
_ formula (3.1.6) for polynomial interpolation; this is dealt with in Problem 41. 
Rather than proceeding with such a development, we first convert (3.8.4)-(3.8.5) 
to an equivalent problem involving polynomials and functions of a complex 
variable. This new formulation is the more natural mathematical setting for 
trigonometric polynomial interpolation. 

Using Euler’s formula 


e = cos(8) + i- sin(@) i=y-l (3.8.6) 
we obtain 
ei 4 9-18 ei9 pid 
cos (@) = Sao sin(@) = — Sar (3.8.7) 
I 


Using these in (3.8.2), we obtain 
p(t) = Leet (3.8.8) 


The coefficients are related by 
Co=a, c.,=3(a,-ib) 


Given {c;}, the coefficients {a,, b,} are easily obtained by solving these latter 
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equations. Letting z = e'’, we can rewrite (3.8.8) as the complex function 


P(z)= = cel (3.8.9) 


jan 


The function z"P,(z) is a polynomial of degree < 2n. 

To reformulate the interpolation problem (3.8.4)-(3.8.5), let z ie ett, 
j = 0,...,2n. With the restriction in (3.8.4), the numbers z ; are distinct points on 
the unit circle |z| = 1 in the complex plane. The interpolation problem is 


P(z)=f(t;)  jf=0,...,20 (3.8.10) 
To see that this is always uniquely solvable, note that it is equivalent to 
O(z,)=z7f(t;) f= 0,...,2n 


with Q(z) = z"P,(z). This is a polynomial interpolation problem, with 2n + 1 
distinct node points zg,..., Z2,; Theorem 3.1 shows there is a unique solution. 
Also, the Lagrange formula (3.1.6) generalizes to Q(z), and thence to P(z). 

There are a number of reasons, both theoretical and practical, for converting 
to the complex variable form of trigonometric interpolation. The most important 
in our view is that interpolation and approximation by trigonometric polynomials 
are intimately connected to the subject of differentiable functions of a complex 
variable, and much of the theory is better understood from this perspective. We 
do not develop this theory, but a complete treatment is given in Henrici (1986, 
chap. 13) and Zygmund (1959, chap. 10). - 


Evenly spaced interpolation The case of interpolation that is of most interest in 
applications is to use evenly spaced nodes :;. More precisely, define 


ee (3.8.11) 


The points f,..., f,,, Satisfy (3.8.4), and the points z,= e4, 7 =0,...,2n, are 
evenly spaced points on the unit circle |z| = 1. Note also that the points z, 
repeat as j increases by 2n + 1. 

We now develop an alternative to the Lagrange form for p,,(1) when the nodes 
{2;} satisfy (3.8.11). We begin with the following lemma. 


Lemma 4_ For all integers k, 


2n ity — 
y efky = i +1 ae =] (3.8.12) 
j=0 0 ek #1 


The condition e'* = 1 is equivalent to k being an integer multiple of 
2n+1. 
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Proof Let z = es, Then using (3.8.11), e“ = e//", and the sum in (3.8.12) 
becomes 


2n 
S= Yaz 
j=0 


If z = 1, then this sums to 2n + 1. If z # 1, then the geometric series 
formula (1.1.8) implies 


Using (3.8.11), z2"*! = e?7*' = 1; thus, S = 0. a 


The interpolation conditions (3.8.10) can be written as 
LX ce*i=f(t;) f= 0,1,...,20 (3.8.13) 


To find the coefficients c,, we use Lemma 4. Multiply equation j by e~“”, then 
sum over j, restricting / to satisfy —n </ <n. This yields 


2n n an 
YY cette e~"s(t,) (3.8.14) 
j=Ok=-n j=0 


Reverse the order of summation, and then use Lemma 4 to obtain 


2n 
k-Dt, 0 k#l 
be ‘ (on +1 k=l 


Using this in (3.8.14), we obtain 


1 2n 
= ~ilt; : = 
et mai hi if (t;) N,-..,N (3.8.15) 


The coefficients {c_,,...,¢,} are called the finite Fourier transform of the data 
{ f(to),-+-» f(t2,)}. They yield an explicit formula for the trigonometric inter- 
polating polynomial p,(t) of (3.8.8). The formula (3.8.15) is related to the . 
Fourier coefficients of f(t): 


asf eity(t) a <I< (3.8.16 
ey Ae f(t) dt 00 cre) 8.16) 


If the trapezoidal numerical integration rule [see Section 5.1] is applied to these 
integrals, using 2 + 1 subdivisions of [0,27], then (3.8.15) is the result, provided 
f(t) is periodic on [0,22]. We next discuss the convergence of p,(t) to f(t). 
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Theorem 3.6 Let f(t) be a continuous, periodic function, and let 27 be an 
integer multiple of its period. Define 


ex(f) = Infimum | Max 1f(2) - (i) (3.8.17) 


with g(t) a trigonometric polynomial. Then the interpolating 
function p,(2) from (3.8.8) and (3.8.15) satisfies 


Max |f(t) ~ p,(t)| sclin(m + 2)]p,(f) 20 (3.8.18) 
O<sts2a 
The constant c is independent of f and n. 


Proof See Zygmund (1959, chap. 10, p. 19), since the proof is fairly com- 
plicated. s 


The quantity p,(f) is called a minimax error (see Chapter 4), and it can be 
estimated in a variety of ways. The most important bound on p,(/) is probably 
that of D. Jackson. Assume that f(f) is k times continuously differentiable on 
[0,27], k => 0, and further assume f‘*)(r) satisfies the condition 


PPu)=f"(as Clty -t|* OS4,t,< 20 
for some 0 < a < 1. (This is called a Hélder condition.) Then 


pf) < — n>1 (3.8.19) 


with c,( f) independent of n. For a proof, see Meinardus (1967, p. 55). 
An alternative error formula to that of (3.8.18) is given in Henrici (1986, cor. 
13.6c), using the Fourier series coefficients (3.8.16) for f(t). 


Example Consider approximating f(t) = e"“, using the interpolating func- 
tion p,(t). The maximum error 


E, = wax UC?) — p,{t)| 


for various values of n, is given in Table 3.15. The convergence is rapid. 


Table 3.15 Error in trigonometric 
polynomial interpolation 
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The fast Fourier transform The approximation of f(t) by p,(t) in the preceding 
example was very accurate for small values of n. In contrast, the calculation of 
the finite Fourier transform (3.8.15) in other applications will often require large 
values of n. We introduce a method that is very useful in reducing the cost of 
calculating the coefficients {c,} when n is large. 

Rather than using formula (3.8.15), we consider the equivalent formula 


1 m-1 ; ; ; 

d,=— Lwik-f  w,=e"/™ k=0,1,...,m—1 (3.8.20) 
m=, 

with given data { fo,.-., fn—1}- This is called a finite Fourier transform of order 

m. For formula (3.8.15), let m =2n+ 1. We can allow k to be any integer, 

noting that 


dpim=d, —0o<k<o (3.8.21) 


Thus it is sufficient to compute do,...,d,,.; Or any other m consecutive 
coefficients d,. 

To contrast the formula (3.8.20) with the alternative presented below, we 
calculate the cost of evaluating do,...,d,,_; using (3.8.20). To evaluate d,, let 
z, = wX. Then 


1 
d-— ¥ fal (3.8.22) 


Using nested multiplication, this requires m— 1 multiplications and m—1 
additions. We ignore the division by m, since often other factors are used. The 
evaluation of z, requires only 1 multiplication, since z, = w,,z,_,, kK = 2. The 
total cost of evaluating do,...,d,,-, is m? multiplications and m(m — 1) 
additions. 

To introduce the main idea behind the fast Fourier transform, let m = pq with 
p and q positive integers greater than 1. Rewrite the definition (3.8.20) in the 
equivalent form 


> so > WPS Fs 


Use wf = exp(—2ai/q) = w,. Then 


1 p-l I 
anh pa ae k=0,1,...,m-1 
Write 
q-1 
e0=— li wk. Osl<p-1l (3.8.23) 
= 
ie a 
d,= — Y wkle) O<k<m-1 (3.8.24) 
P t=0 


182 INTERPOLATION THEORY 


Once {e{!)} is known, each value of d, will require p multiplications, using a 
nested multiplication scheme as in (3.8.22). The evaluation of (3.8.24) will require 
mp multiplications, assuming all e{” have been computed previously. There will 
be a comparable number of additions. 

We turn our attention to the computation of the quantities e{”. The index k 
ranges from 0 to m — 1, but not all of these need be computed. Note that 


143 
ios k+ st 
ef), — q oy wi A ee - ef? 
g=0 
because w4 = 1. Thus e( needs to be calculated for only k = 0,1,...,¢—1, 
and then it repeats itself. For each /, {e§,...,e{,} is the finite Fourier 


transform of the data { f,, fisp.---> Sis pcq-1) }» 9 < / < p — 1. Thus the computa- 
tion of {e{} amounts to the computation of p finite Fourier transforms of order 
q (i.e., for data of length q). 

The fast Fourier transform amounts to a repeated use of this idea, to reduce 
the computation to finite Fourier transforms of smaller and smaller order. To be 
more specific, suppose m = 2’ for some integer r. As a first step, let p = 2, 
q = 2’~!. Then the computation of (3.8.24) requires 2m multiplications plus the 
cost of evaluating two finite Fourier transforms of order gq = 2’~!. For each of 
these, repeat the process recursively. There will be r levels in this process, 
resulting eventually in the evaluation of finite Fourier transforms of order one. 
The total number of multiplications is given by 


om oaffS]] ff a]] (3 


which sums to 


2rm = 2m-log,m (3.8.25) 


Thus the number of operations is proportional to m - log, m, in contrast to the 
m?* operations of the nested multiplication algorithm of (3.8.22). When m is 
large, say 2'° or larger, this results in an enormous savings in calculation time. 
For the particular case of m = 2’, a more careful accounting will show that only 
m- log, m multiplications are actually needed, and there is a variant procedure 
that requires only half of this. 

For other values of m, there are modifications of the preceding procedure that 
will still yield an operations count proportional to m - log m, just as previously 
shown, but the case of m= 2" leads to the greatest savings. For a further 
discussion of this topic, we refer the reader to Henrici (1986, chap. 13). Henrici 
also contains a discussion of the stability of the algorithm when rounding errors 
are taken into account. The use of the fast Fourier transform has revolutionized 
many subjects, making computationally feasible many calculations that previ- 
ously were not practical. 
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Discussion of the Literature 


As noted in the introduction, interpolation theory is a foundation for the 
development of methods in numerical integration and differentiation, approxima- 
tion theory, and the numerical solution of differential equations. Each of these. 
topics is developed in the following chapters, and the associated literature is 
discussed at that point. Additional results on interpolation theory are given in 
de Boor (1978), Davis (1963), Henrici (1982, chaps. 5 and 7), and Hildebrand 
(1956). For an historical account of many of the topics of this chapter, see 
Goldstine (1977). 

The introduction of digital computers produced a revolution in numerical 
analysis, including interpolation theory. Before the use of digital computers, hand 
computation was necessary, which meant that numerical methods were used that 
minimized the need for computation. Such methods were often more complicated 
than the methods now used on computers, taking special advantage of the unique 
mathematical characteristics of each problem. These methods also made exten- 
sive use of tables, to avoid repeating calculations done by others; interpolation 
formulas based on finite differences were used extensively. A large subject was 
created, called the finite difference calculus, and it was used in solving problems 
in several areas of numerical analysis and applied mathematics. For a general 
introduction to this approach to numerical analysis, see Hildebrand (1956) and 
the references contained therein. 

The use of digital computers has changed the needs of other areas for 
interpolation theory, vastly reducing the need for finite difference based interpo- 
lation formulas. But there is still an important place for both hand computation 
and the use of mathematical tables, especially for the more complicated functions 
of mathematical physics. Everyone doing numerical work should possess an 
elementary book of tables such as the well-known CRC tables. The National 
Bureau of Standards tables of Abramowitz and Stegun (1964) are an excellent 
reference for nonelementary functions. The availability of sophisticated hand 
calculators and microcomputers makes possible a new level of hand (or personal) 
calculation. , 

Piecewise polynomial approximation theory has been very popular since the 
early 1960s, and it is finding use in a vanety of fields. For example, see Strang 
and Fix (1973, chap. 1) for applications to the solution of boundary value 
problems for ordinary differential equations, and see Pavlidis (1982, chaps. 
10-12) for applications in computer graphics. Most of the interest in piecewise 
polynomial functions has centered on spline functions. The beginning of the 
theory of spline functions is generally credited to Schoenberg in his 1946 papers, 
and he has been prominent in helping to develop the subject [e.g., see Schoenberg 
(1973)}. There is now an extensive literature on spline functions, involving many 
individuals and groups. For general surveys, see Ahlberg et al. (1967), de Boor 
(1978), and Schumaker (1981). Some of the most widely used computer software 
for using spline functions is based on the programs in de Boor (1978). Versions of 
these are available in the IMSL and NAG numerical analysis libraries. 

Finite Fourier transforms, trigonometric interpolation, and associated topics 
are quite old topics; for example, see Goldstine (1977, p. 238) for a discussion of 
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Gauss’ work on trigonometric interpolation. Since Fourier series and Fourier 
transforms are important tools in much of applied mathematics, it is not 
surprising that there is a great deal of interest in their discrete approximations. 
Following the famous paper by Cooley and Tukey (1965) on the fast Fourier 
transform, there has been a large increase in the use of finite Fourier transforms 
and associated topics. For example, this has led to very fast methods for solving 
Laplace’s partial differential equation on rectangular regions, which we discuss 
further in Chapter 8. For a classical account of trigonometric interpolation, see 
Zygmund (1959, chap. 10), and for a more recent survey of the entire area of 
finite Fourier analysis, see Henrici (1986, chap. 13). 

Multivariate polynomial interpolation theory is a rapidiy developing area, 
which for reasons of space has been omitted in this text. The finite element 
method for solving partial differential equations makes extensive use of multi- 
variate interpolation theory, and some of the better presentations of this theory 
are contained in books on the finite element method. For example, see Jain 
(1984), Lapidus and Pinder (1982), Mitchell and Wait (1977), and Strang and Fix 
(1973). More recently, work in computer graphics has led to new developments 
{see Barnhill (1977) and Pavlidis (1982, chap. 13)]. 
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Problems 


1. Recall the Vandermonde matrix X given in (3.1.3), and define 
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(a) Show that V(x) is a polynomial of degree n, and that its roots are 
Xo,---,X,—,- Obtain the formula 


V(x) = (x - xo) ote = poy) Ve eg) 


Hint: Expand the last row of V,(x) by minors to show that V,(x) is a 
polynomial of degree n and to find the coefficient of the term x”. 


(b) Show 


det (X) = V,A(x,) = IT Ce x;) 


sj<isn 


For the basis functions /; ,(x) given in (3.1.5), prove that for any n > 1, 


Liin(x)=1 ~~ forall x 
j=0 


Recall the Lagrange functions /)(x),..., /,(x), defined in (3.1.5) and then 
rewritten in a slightly different form in (3.2.4), using 


¥, (x) = (x — x9) +++ (xe = x,) 


Let w, = [%/(x,)]~'. Show that the polynomial p,(x) interpolating f(x) 
can be written as 


E [wsle)|/e - x) 
a(x) = Hp — 
2 w,/(x “ x;) 


j=0 


provided x is not a node point. This is called the barycentric representation 
of p,(x), giving it as a weighted sum of the values { f(x9),..., f(x,)}. For 
a discussion of the use of this representation, see Henrici (1982, p. 237). 


Consider linear interpolation in a table of values of e*, 0 < x < 2, with 

= .01. Let the table values be given to five significant digits, as in the 
CRC tables. Bound the error of linear interpolation, including that part due 
to the rounding errors in the table entries. 


Consider linear interpolation in a table of cos(x) with x given in degrees, 
0 < x < 90°, with a stepsize h = 1’ = 3, degree. Assuming that the table 
entries are given to five significant digits, bound the total error of interpola- 
tion. 


Suppose you are to make a table of values of sin(x), 0 < x < 7/2, with a 
stepsize of h. Assume linear interpolation is to be used with the table, and 
suppose the total error, including the effects due to rounding in table 
entries, is to be at most 10~°. What should A equal (choose it in a 


10. 


13. 


11. 


12. 
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convenient size for actual use) and to how many significant digits should 
the table entries be given? 


Repeat Problem 6, but using e* on0 <x <1. 


Generalize to quadratic interpolation the material on the effect of rounding 
errors in table entries, given in Section 3.1. Let €, = f(x,) — f,, = 0,1,2, 
and € > Max {|éol, |€,|, |€2| }. Show that the effect of these rounding errors 
on the quadratic interpolation error is bounded by 1.25e, assuming x9 < x 
<x, and x, — Xp = Xy— x, =A. 


Repeat Problem 6, but use quadratic interpolation and the result of 
Problem 8. 


Consider producing a table of values for f(x) = logy) x, 1 < x < 10, and 
assume quadratic interpolation is to be used. Let the total interpolation 
error, including the effect of rounding in table entries, be less than 107°. 
Choose an appropriate grid spacing A and the number of digits to which 
the entries should be given. Would it be desirable to vary the spacing as x 
varies in [1, 10]? If so, suggest a suitable partition of [1, 10] with correspond- 
ing values of h. Use the result of Problem 8 on the effect of rounding error. 


Let xo,..., x, be distinct real points, and consider the following interpola- 
tion problem. Choose a function 


n 
PAH) = xs ce 
j=0 


such that 
P(x;)=y, i=0,1,...,2 


with the { y,} given data. Show there is a unique choice of ¢,...,¢,- Hint: 
The problem can be reduced to that of ordinary polynomial interpolation. 


Consider finding a rational function p(x) = (a+ bx)/(1+ cx) that 
satisfies 


P(x,)=y, #=1,2,3 


with x,, x2, x3 distinct. Does such a function p(x) exist, or are additional 
conditions needed to ensure existence and uniqueness of p(x)? For a 
general theory of rational interpolation, see Stoer and Bulirsch (1980, 
p. 58). 


(a) Prove the recursive form of the interpolation formula, as given in 


(3.2.8). 


(b) Prove the recursion formula (3.2.7) for divided differences. 
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14. 


15. 


16. 


17. 


18. 


19. 


20. 
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Prove the relations (3.2.18), pertaining to the variable divided difference of 
a polynomial. 


Prove that Vol(+7,,) = 1/n!, where +, is the simplex in R" defined in (3.2.14) 
of Theorem 3.3 in Section 3.2. Hint: Use (3.2.13) and other results on 
divided differences, along with a special choice for f(x). 


Let p,(x) be the quadratic polynomial interpolating f(x) at the evenly 
spaced points x9, xX; = Xj th, x, =X 9+ 2h. Derive formulas for the - 
errors f’(x;) — p3(x;), (= 0,1,2. Assuming f(x) is three times continu- 
ously differentiable, give computable bounds for these errors. Hint: Use the 
error formula (3.2.11). 


Produce computer subroutine implementations of the algorithms Divdif 
and Interp given in Section 3.2, and then write a main driver program to 
use them in doing table interpolation. Choose a table from Abramowitz and 
Stegun (1964) to test the program, considering several successive degrees of 
n for the interpolation polynomial. 


Do an inverse interpolation problem using the table for J,(x) given in 
Section 3.2. Find the value of x for which Jo(x) = 0, that is, calculate an 
accurate estimate of the root. Estimate your accuracy, and compare this 
with the actual value x = 2.4048255577. 


Derive the analogue of Lemma 1, given in Section 3.3, for backward 
differences. Use this and Newton’s divided form of the interpolating 
polynomial (3.2.9) to derive the backward difference interpolation formula 
(3.3.11). 


Consider the following table of values for 


jo(x) = (= “Sip(x) 


taken from Abramowitz and Stegun (1964, chap. 10). 


21. 


22. 


25. 


26. 
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Based on the rounding errors in the table entries, what should be the 
maximum degree of polynomial interpolation used with the table? Hint: 
Use the forward difference table to detect the influence of the rounding 
errors. 


The following data are taken from a polynomial of degree < 5. What is the 
degree of the polynomial? 


eds ek De le Sh - 23 
pix)| -5 1 1 1 7 ~«25 


The following data have noise in them that is large relative to rounding 
error. Find the noise and change the data appropriately. Only the function 
values are given, since the node points are unnecessary for computing the 
forward difference table. 


304319 419327 545811 683100 
326313 443655 572433 711709 
348812 468529 599475 740756 
371806 493852 626909 770188 
395285 519615 654790 800000 


For f(x) = 1/0 + x”), -5 <x <5, produce p,(x) using n + 1 evenly 
spaced nodes on [—5, 5]. Calculate p,(x) at a large number of points, and 
graph it or its error on [—5,5], as in Figure 3.6. 


Consider the function e* on [0, 5] and its approximation by an interpolat- 
ing polynomial. For n > 1, let h = b/n, x; = jh, j = 0,1,...,n, and let 
P,(x) be the nth-degree polynomial interpolating e* on the nodes 
Xo,---, X,- Prove that 


Max |e*—p,(x)| 70 as noo 
O<x<b 


Hint: Show |¥,(x)| < n!h"*}, 0 < x < 6; Jook separately at each subinter- 
val [x;_1, x;}. 


Prove that the general Hermite problem (3.6.16) has a unique solution p(x) 
among all polynomials of degree < N — 1. Hint: Show that the homoge; 
neous problem for the associated linear system has only the zero solution. 
Consider the Hermite problem 

p(x;) = yf) j= 1,2 j=0,1,2 
with p(x) a polynomial of degree < 5. 


(a) Give a Lagrange type of formula for p(x), generalizing (3.6.12) for 
cubic Hermite interpolation. Hint: For the basis functions satisfying 
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27. 


29. 
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I(x.) = U(x) = I(x) = 0, use I(x) = (x — x,)32(x), with g(x) of 
degree < 2. Find g(x). 


(b) Give a Newton divided difference formiula, generalizing (3.6.13). 
(c) Derive an error formula generalizing (3.6.14). 


Let p(x) be a polynomial solving the Hermite interpolation problem 


PM(al=fM(a) pb) =fY(b) ff = 0,1,...,2-1 


Its existence is guaranteed by the argument in Problem 25. Assuming f(x) 
has 2n continuous derivatives on [a, b], show that for a < x <b, 


_ (x= a)"(x 5)" 


eRe: a te) 


with a < &, < b. Hint: Generalize the argument used in Theorem 3.2. 
(a) Find a polynomial p(x) of degree < 2 that satisfies 
P(Xo)=Y P(X) =y PCa) = HF 
Give a formula in the form 
P(x) = Yolo(x) + yo (x) + yila(x) 

(b) Find a formula for the following polynomial interpolation problem. 
Let x, =x, + ih, i= 0,1,2. Find a polynomial p(x) of degree < 4 
for which 

p(x;) =; Roem ea e s 
P(x) =% P(m)=H% 
with the y values given. 
Consider the problem of finding a quadratic polynomial p(x) for which 
P(Xo) = Yo P(x) = »f P(x2) = 92 

with xy # x, and { yp, yj, y2} the given data. Assuming that the nodes 

Xo, X,,X2 are real, what conditions must be satisfied for such a p(x) to 

exist and be unique? This problem, Problem 28(a), and the following 


problem are examples of Hermite—Birkhoff interpolation problems [see 
Lorentz et al. (1983)]. 


30. 


31. 


32. 


33. 
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(a) Show there is a unique cubic polynomial p(x) for which 
P(X) =f(xo) — P(x2) = f(x2) 
P’(x1) = f(x) p(x) = f(x) 


where f(x) is a given function and x) # x,. Derive a formula for 
p(x). 
(b) Let x, = —1, x, = 0, x, = 1. Assuming f(x) is four times continu- 
ously differentiable on [—1, 1], show that for -1 <x <1, 
x4—] 
4! 


f(x) — p(x) = fe) 


for some €, © [—1, 1]. .Wint: Mimic the proof of Theorem 3.2. 


For the function f(x) = sin(x), 0 < x < 7/2, find the piecewise cubic 
Hermite function Q(x) and the piecewise cubic Lagrange function L(x), 
for n = 3,6,12, m= in. Evaluate the maximum errors f(x) — L,,(x), 
f(x) — L(x), f(x) — Q,(x), and f(x) — Q/(x). This can be done with 
reasonable accuracy by evaluating the errors at 8n evenly spaced points on 


0, 7/2}. 


(a) Let p3(x) denote the cubic polynomial interpolating f(x) at the 
evenly spaced points x; = x9 + jh, j = 0,1,2,3. Assuming f(x) is 
sufficiently differentiable, bound the error in using p4(x) as an 
approximation to f’(x), x» < x < x3. 


(b) Let H,(x) denote the cubic Hermite polynomial interpolating f(x) 
and f’(x) at x») and x, = x, + A. Bound the error f’(x) — Hj(x) for 
xx SX: 


(c) Consider the piecewise polynomial functions L,,(x) and Q,(x) of 
Section 3.7, m= 2n/3, and bound the errors f’(x) — Li(x) and 
f(x) — Q/(x). Apply it to the specific case of f(x) = x* and com- 
pare your answers with the numerical results given in Table 3.14. 


Let s(x) be a spline function of order m. Let b be a knot, and let s(x) bea 
polynomial of degree < m — 1 on [a, b] and [b, c]. Show that if s°"~ (x) 
is continuous at x = 5, then s(x) is a polynomial of degree < m — 1 for 
asxse. 


Derive the conditions on the cubic interpolating spline s(x) that are 
implied by the “not-a-knot” endpoint conditions. Refer to the discussion of 
case 2, following (3.7.27). 
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35. 


37. 
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Consider finding a cubic spline interpolating function for the data 


xi 0 1 2 25 3 4 
y| 14 06 10 65 6 1.0 


Use the “not-a-knot” condition to obtain boundary conditions supplement- 
ing (3.7.19). Graph the resulting function s(x). Compare it to the use of 
piecewise linear interpolation, connecting the successive points (x,, y;) by 
line segments. 


Write a program to investigate the rate of convergence of cubic spline 
interpolation (as in Table 3.13) with various boundary conditions. Many 
computer centers will have a package to produce such an interpolant, with 
the user allowed to specify the boundary conditions. Otherwise, use a linear 
systems solver with (3.7.19) and the additional boundary conditions. In- 
vestigate the following boundary conditions: (a) derivatives given as in 
(3.7.20); (b) the “not-a-knot” condition, given in Problem 34; and (c) the 
natiral spline conditions, M, = M, = 0, of Problem 38. Apply this pro- 
gram to studying the convergence of s(x) to f(x) in the following cases: 
(1) f(x) = e* on [0,1], (2) f(x) = sin(x) on (0, 7/2}, and (3) f(x) = xvx 


on [0,1]. Note and compare the behavior of the error near the endpoints. 


(a) Let q(x) be a cubic spline with a single knot x = a. In addition, 
suppose that q(x) = 0 for x < a. Show that q(x) = ¢(x — a)} for 
some c. Hint: For x = a, write the cubic polynomial s(x) as 


s(x) =¢) + ¢,(x — a) + ¢,(x - a)’ +65(x~ a)’. 
Then apply the assumptions about s(x). 


(b) Using part (a), prove the representation (3.7.31) for cubic spline 
functions (m = 4). 


Consider calculating a cubic interpolating spline based on (3.7.19) with the 
additional boundary conditions : 


9"(x) =My=0  $"(x,) = M, =0 


This has a unique solution 5(x). Show that 
[ [S"Q)F dx < I Le"XF dx 


where g(x) is any twice continuously differentiable function that satisfies 
the interpolating conditions g(x;) = y,, i = 0,1,..., 2. Hint: Show (3.7.27) 
is valid for $(x). [This interpolating spline §(x) is called the natural cubic 
interpolating spline. It is a smooth interpolant to the data, but it usually 
converges slowly near the endpoints. For more information on it, see 
de Boor (1978, p. 55).] 


39. 


40. 


41. 


42. 
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To define a B-spline of order m, with [x,, x,,,,] the interval on which it is 
nonzero, define 


Bi(x) = Ae Iai, Xigte-+> Xitm] 


with f(t) = (t — x)771 Derive a recursion relation for B{”)(x) in terms 
of B-splines of order m ~— 1. 


Use the definition (3.7.33) to investigate the behavior of cubic B-splines 
when nodes are allowed to coincide. Show that if two of the nodes in 
{X;, Xj415-++» X;44} coincide, then B(x) has only one continuous deriva- 
tive at the coincident node. Similarly, if three of them coincide, show that 
B,(x) is continuous at that point, but is not differentiable. 


Let 0<t)<t, < ++: <t,, <2, and consider the trigonometric poly- 
nomial interpolation problem (3.8.5). Define 

L(t) = ———_ 

At) i sin 3(t; — t,) 


k=0 
k#j 


for j = 0,1,...,2n. Easily, L(t) = 855, 0 <i, j < 2n. Show that L(t) isa 


trigonometric polynomial of degree < n. Then the solution to (3.8.5) is 
given by 


2n 
P,(t) = LsMyy(o 


Hint: Use induction on n, along with standard trigonometric identities. 


(a) Prove the following formulas: 


mol Qajk 
Y sin | Z 


= 0, m > 2, all integers k 
j=l 


a 2ajk\ {mk amultiple of m 
~ \0  k nota multiple of m 


(b) Use these formulas to derive formulas for the following: 


mee 2ajk 2rjl ie 2ajk 2rjl 
» cos | cos (=) Da sin| |sin( 2] 
m m fai m m 


mat 2ajk\ {2a 
2 cos ; |sin( zs ] 


j=0 


j=0 
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for 0<k, |< m-—1. The formulas obtained are referred to as 
discrete orthogonality relations, and they are the analogues of integral 


orthogonality relations for {cos (kx), sin (kx)}. 


Calculate the finite Fourier transform of order m of the following se- 
quences. 


(a) x,=1 O<k<m-1 


(b) x,=(-)*F O<ksm-1-— meven 


(c) x,=k Osk<m-1 


FOUR 


APPROXIMATION 
OF FUNCTIONS 


To evaluate most mathematical functions, we must first produce computable 
approximations to them. Functions are defined in a variety of ways in applica- 
tions, with integrals and infinite series being the most common types of formulas 
used for the definition. Such a definition is useful in establishing the properties of 
the function, but it is generally not an efficient way to evaluate the function. In 
this chapter we examine the use of polynomials as approximations to a given 
function. Various means of producing polynomial approximations are described, 
and they are compared as to their relative accuracy. 

For evaluating a function f(x) on a computer, it is generally more efficient of 
space and time to have an analytic approximation to f(x) rather than to store a 
table and use interpolation. It is also desirable to use the lowest possible degree 
of polynomial that will give the desired accuracy in approximating f(x). The 
following sections give a number of methods for producing an approximation, 
and generally the better approximations are also the more complicated to 
produce. The amount of time and effort expended on producing an approxima- 
tion should be directly proportional to how much the approximation will be used. 
If it is only to be used a few times, a truncated Taylor series will often suffice. But 
if an approximation is to be used millions of times by many people, then much 
care should be used in producing the approximation. 

There are forms of approximating functions other than polynomials. Rational 
functions are quotients of polynomials, and they are usually a somewhat more 
efficient form of approximation. But because polynomials furnish an adequate 
and efficient form of approximation, and because the theory for rational function 
approximation is more complicated than that of polynomial approximation, we 
have chosen to consider only polynomials. The results of this chapter can also be 
used to produce piecewise polynomial approximations, somewhat analogous to 
the piecewise polynomial interpolating functions of Section 3.7 of the preceding 
chapter. 


4.1 The Weierstrass Theorem and Taylor’s Theorem 


To justify using polynomials to approximate continuous functions, we present the 
following theorem. 
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Theorem 4.1 (Weierstrass) Let f(x) be continuous for a <x <b and let 
« > 0. Then there is a polynomial p(x) for which 


f(x) -p(x)|<s¢ asxsb 


Proof There are many proofs of this result and of generalizations of it. Since 
this is not central to our numerical analysis development, we just indicate 
a constructive proof. For other proofs, see Davis (1963, chap. 6). 
Assume [a, b] = [0,1] for simplicity: by an appropriate change of 
variables, we can always reduce to this case if necessary. Define 


p, (x) = > (Z)=)x*0 =x)"* 02x22 


k=0 


Let f(x) be bounded on [0,1]. Then 


Limit p,(x) = f(x) 


n—-o 


at any point x at which f is continuous. If f(x) is continuous at every x 
in [0, 1], then the convergence of p, to f is uniform on (0, 1], that is, 


pMax | f(x) —p,(x)| 70 as nro (4.1.1) 


This gives an explicit way of finding a polynomial that satisfies the 
conclusion of the theorem. The proof of these results can be found in 
Davis (1963, pp. 108-118), along with additional properties of the 
approximating polynomials p,(x), which are called the Bernstein poly- 
nomials. They mimic extremely well the qualitative behavior of the 
function f(x). For example, if f(x) is r times continuously differentiable 
on [0, 1], then 


Max | f(x) -— p&"(x)| 20 as n> 
O<xsl 


But such an overall approximating property has its price, and in this 
case the convergence in (4.1.1) is generally very slow, For example, if 
f(x) = x’, then 

Limit n[ p,(x) — f(x)] = x(1 - x) 

n-* 0 


and thus 


air 


Pn(x) — x? = —x(1 — x) 


for large values of n. The error does not decrease rapidly, even for 
approximating such a trivial case as f(x) = x. ] 
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Taylor’s theorem Taylor’s theorem was presented earlier, in Theorem 1.4 of 
Section 1.1, Chapter 1. It is the first important means for the approximation of a 
function, and it is often used as a preliminary approximation for computing some 
more efficient approximation. To aid in understanding why the Taylor approxi- 
mation is not particularly efficient, consider the following example. 


Example Find the error of approximating e* using the third-degree Taylor 
polynomial p,(x) on the interval [—1, 1], expanding about x = 0. Then 


p3(x) =14+x+ 4x? + 4x3 
e* — p;(x) = R4(x) = dx‘eé (4.1.2) 


with £ between x and 0. 
To examine the error carefully, we bound it from above and below: 


] e 

aX <0" Pslx) < 3 0O<x<l 
ae, (Qe 24 0 
—- x4 < et — — alex < 
mae Se" — Ps(x) = 35% 2x< 


The error increases for increasing |x|, and by direct calculation, 


Max |e*— p,| = .0516 (4.1.3) 
-l<x<l 


The error is not distributed evenly through the interval [—1, 1] (see Figure 4.1). It 
is much smaller near the origin than near the endpoints —1 and 1. This uneven 
distribution of the error, which is typical of the Taylor remainder, means that 
there are usually much better approximating polynomials of the same degree. 
Further examples are given in the next section. 


The function space C[a,b] The set C[a,b] of all continuous real valued 
functions on the interval [a, b] was introduced earlier in Section 1.1 of Chapter 1. 


y 
0.052 


x 
-1 1 


Figure 4.1 Error curve for p3(x) = e*. 
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With it we will generally use the norm 
IIflleo = Max |f(x)| fe C[a, 6] (4.1.4) 
asx<b 


It is called variously the maximum norm, Chebyshev norm, infinity norm, and 
uniform norm. It is the natural measure to use in approximation theory, since we 
will want to determine and compare 


If — Plleo = Max | f(x) ~ p(x)| (4.1.5) 


for various polynomials p(x). Another norm for C[a, b] is introduced in Section 
4.3, one that is also useful in measuring the size of f(x) — p(x). 

As noted earlier in Section 1.1, the maximum norm satisfies the following 
characteristic properties of a norm: 


(fll =0  ifandonlyif f=0 (4.1.6) 
laf il = Jel fi] forall f © C[a, 5] and all scalars a (4.1.7) 


f+ gis i/+iigh all f,geCf[a, 6] (4.1.8) 


The proof of these properties for (4.1.4) is quite straightforward. And the 
properties show that the norm should be thought of as a generalization of the 
absolute value of a number. 

Although we will not make any significant use of the idea, it is often useful to 
regard C[a, b] as a vector space. The vectors are the functions f(x), a< x <b. 
We define the distance from a vector f to a vector g by 


D(f,g) =|If- gil (4.1.9) 


which is in keeping with our intuition about the concept of distance in simpler 
vector spaces. This is illustrated in Figure 4.2. 
Using, the inequality (4.1.8), 


Wf—gll=l(¢- 4) + (2-8) Sif All + lh - all 


D(f, g) < D(f,h) + D(h, g) (4.1.10) 


This is called the triangle inequality, because of its obvious interpretation in 
measuring the lengths of sides of a triangle whose vertices are f, g, and h. The 
name “triangle inequality” is also applied to the equivalent formulation (4.1.8). 


0 f-8 
g 


Figure 4.2 Ilustration for defining 
distance D(f, g). 
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Another useful result is the reverse triangle inequality 


fil — tsi < lf all (4.1.11) 
To prove it, use (4.1.8) to give 
fll < lf — sll + lel 
IF — gh < lf — ail 
By the symmetric argument, 
Kgl - WF < lg — fl = Lf — sll 


with the last equality following from (4.1.7) with a = ~1. Combining these two 
inequalities proves (4.1.11). 

A more complete introduction to vector spaces (although only finite dimen- 
sional) and to vector norms is given in Chapter 7. And some additional geometry 
for C[a, b] is given in Sections 4.3 and 4.4 with the introduction of another 
norm, different from (4.1.4). For cases where we want to talk about functions that 
have several-continuous derivatives, we introduce the function space C’[a,.b], 
consisting of functions f(x) that have r continuous derivatives on [a, b]. This 
function space is of independent interest, but we regard it as just a simplifying 
notational device. 


4.2 The Minimax Approximation Problem 


Let f(x) be continuous on [a, 5]. To compare polynomial approximations p(x) 
to f(x), obtained by various methods, it is natural to ask what is the best possible 
accuracy that can be attained by using polynomials of each degree n > 0. Thus 
we are lead to introduce the minimax error 


e,(f) = Infimum||f — ql. (4.2.1) 
deg (q)<n 


There does not exist a polynomial q(x) of degree < n that can approximate f(x) 
with a smaller maximum error than p,(/). 

Having introduced p,(f), we seek whether there is a polynomial q*(x) for 
which 


Pal f) = IF ~ Glleo (4.2.2) 


And if so, is it unique, what are its characteristics, and how can it be constructed? 
The approximation q*(x) is called the minimax approximation to f(x) on [a, b], 
and its theory is developed in Section 4.6. 


Example Compute the minimax polynomial approximation q¥(x) to e* on 
-l<x<l. Let gf(x) =a) + a,x. To find ag and a,, we have to use some 
geometric insight. Consider the graph of y = e* with that of a possible ap- 
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Figure 4.3. Linear minimax approxima- 
tion to e*. , 


proximation y = q,(x), as in Figure 4.3. Let 
e(x) = e* — [ag + a,x] (4.2.3) 


Clearly, g#(x) and e* must be equal at two points in[—1,1], say at -1 <x, < 
X < 1. Otherwise, we could improve on the approximation by moving the graph 
of y = q¥(x) appropriately. Also 


p,= Max |e(x)| 
-isx<l 


and «(x,) = «(x,) =0. By another argument based on shifting the graph of 
y = q*(x), we conclude that the maximum error p, is attained at exactly three 
points. 


e(-1) =p e(1) = 9, (x3) = py (4.2.4) 


where x, < x, < x. Since e(x) has a relative minimum at x,, we have e’(x;) = 0. 
Combining these four equations, we have 


e'-—fa,-a]=p e—[ay+a,] =p, 
e3—[a,ta,x;)=—-p, e*—a,=0 (4.2.5) 
These have the solution 


e-e! 
aS +=1.1752 x, =In,(a,) = .1614 


1 x 
P1 il + ra —e7') = 2788 


i 


dy = Py + (1 — x3) a, = 1.2643 
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¥ 


Figure 4.4 Error for cubic minimax 
approximation to e~. 


Thus 
g(x) = 1.2643 + 1.1752x (4.2.6) 


and p, = .2788. 
By using what is called the Remes algorithm, we can also construct q}(x) for 
e*~ on[—1, 1]: . 


gk (x) = 994579 + .995668x + .542973x? + .179533x? (4.2.7) 


The graph of its error is given in Figure 4.4, and in contrast to the Taylor 
approximations (see Figure 4.1), the error is evenly distributed throughout the 
interval of approximation. 

We conclude the example by giving the errors in the minimax approximation 
q*(x) and the Taylor approximations p,(x) for f(x) = e*, -1<x<1. The 
maximum errors for various n are given in Table 4.1. 

The accuracy of qg*(x) is significantly better than that of p,(x), and the 
disparity increases as n increases. It should be noted that e* is a function with a 
rapidly convergent Taylor series, say in comparison to log x and tan~! x. With 
these latter functions, the minimax would look even better in comparison to the 
Taylor series. 


Table 4.1. Taylor and minimax 
errors for e~ 


n If = Prlleo If = Grillo 
1 7.18E —1 2,798 = 1 
2 2.18E — 1 4.50E — 2 
3 5.16E — 2 5.53E — 3 
4 9.95E — 3 5.ATE — 4 
5 1.62E — 3 4.52E — 5 
6 2.26E — 4 3.21E — 6 
7 2.79E — 5 2.00E ~ 7 
8 3.06E — 6 L.11E — 8 
9 3.01E — 7 5.52E — 10 


204 APPROXIMATION OF FUNCTIONS 


4.3 The Least Squares Approximation Problem 


Because of the difficulty in calculating the minimax approximation, we often go 
to an intermediate approximation called the /east squares approximation. As 
notation, introduce 


lll, = yf ih "Ie(x)['dx  g © Cfa,b] (4.3.1) 


This is a function norm, satisfying the same properties (4.1.6)—-(4.1.8) as the 
maximum norm of (4.1.4). It is a generalization of the ordinary Euclidean norm 
for R", defined in (1.1.17). We return to the proof of the triangle inequality 
(4.1.8) in the next section when we generalize the preceding definition. 

For a given f € C[a, b] and n = 0, define 


M,(f) = Infimum||f — rll, (4.3.2) 
deg(r)<n 


As before, does there exist a polynomial r* that minimizes this expression: 


M,(f) = If — rille (4.3.3) 


Is it unique? Can we calculate it? 
To further motivate (4.3.2), consider calculating an average error in the 
approximation of f(x) by r(x). For an integer m > 1, define the nodes x, by 


1\{b-a 
x-a+(s-5]| jJ=1,2,...,m 


m 


These are the midpoints of m evenly spaced subintervals of [a, 6]. Then an 
average error of approximating f(x) by r(x) on [a, b}] is 


m 


; : 1/2 
E= Limit { 2 [ F(x,) - r(x;)| 


jal 


- vim =F [4 . Mal (- — al 


== [fhe 


_ Wale 
aoe emg (434) 


Thus the least squares approximation should have a small average error on [a, 5}. 
The quantity E is called the root-mean-square error in the approximation of f(x) 


by r(x). 
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Example Let f(x) =e*, -1 <x <1, and let r,(x) = by + b,x. Minimize 
I~ n= fi [et bo - bx]? de = Flbo,,) (4.3.5) 
If we expand the integrand and break the integral apart, then F(bo, b,) is a 


quadratic polynomial in the two variables bo, by, 


Fe fe (e2* + b2 + b3x? — 2byxe* + 2byb,x} dx 
=f 
To find a minimum, we set 


dbp by 


which is a necessary condition at a minimal point. Rather than differentiating in 
the previously given integral, we merely differentiate through the integral in 
(4.3.5), 


OF 1 fe] e 2 1 5 

ape Igy at) ge eae: 
aF 7 

= ob, = af le = bo = b,x|(-x) dx 


Then 


lin 
b== e* dx = sinh(1)-= 1.1752 
a) (1) 


3 
= x = whe 
b, = at dx = 3e 1.1036 


r#(x) = 1.1752 + 1.1036x (4.3.6) 


By direct examination, 
le* — ri'llo = -44 


This is intermediate to the approximations g* and p,(x) derived earlier. Usually 
the least squares approximation is a fairly good uniform approximation, superior 
to the Taylor series approximations. 

As a further example, the cubic least squares approximation to é* on [—1, 1] 
is given by 


r(x) = 996294 + .997955x + .536722x? + .176139x3 (4.3.7) 


and 
= 0112 


lle* ~ rH leo 
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¥ 


0.011 


Figure 4.5 Error in cubic least squares ap- 
proximation to e*. 


The graph of the error is given in Figure 4.5. Note that the error is not-~ 
distributed as evenly in [—1,1] as was true of the minimax q¥(x) in Figure 4.4. 


The general least squares problem We give a generalization of the least squares 
problem (4.3.2), allowing weighted average errors in the approximation of f(x) 
by a polynomial r(x). The general theory of the existence, uniqueness, and 
construction of least squares approximations is given in Section 4.5. It uses the 
theory of orthogonal polynomials, which is given in the next section. 

Let w(x) be a nonnegative weight function on the interval (a, b), which may 
be infinite, and assume the following properties: 


1. 
[ixro(x) dx (4.3.8) 


is integrable and finite for all n > 0; 
2. Suppose that 


[w()a(2) dx =0 (4.3.9) 


for some nonnegative continuous function g(x); then the function g(x) = 0 
on (a, b). 


Example The following are the weight functions of most interest in the develop- 
ments of this text: 


w(x) =1 a<x<b 
1 

ics ake eee -Il<x<l 

w(x) =e* O<x<o 
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For a finite interval [a@, 5], the general least squares problem can now be 
stated. Given f © Cla, b], does there exist a polynomial r*(x) of degree <n that 
minimizes 

b 
[we LF(e) - r(x) P ax (4.3.10) 
among all polynomials r(x) of degree <n? The function w(x) allows different 
degrees of importance to be given to the error at different points in the interval 


{a, b]. This will prove useful in developing near minimax approximations. 
Define 


Fl ay Giese By) - fos) its — Ea dx (4.3.11) 
a j=o 


in order to compute (4.3.10) for an arbitrary polynomial r(x) of degree <n. We 
want to minimize F as the coefficients {a,;} range over all real numbers. A 


necessary condition for a point (do,..., a,) to be a minimizing point is 


OF 
— =0 i=0,1l,...,7 (4.3.12) 
0a; 


By differentiating through the integral in (4.3.11) and using (4.3.12), we obtain 
the linear system 


oy a, fw(x)x!* dx = [w(x ) f(x) x! dx i=0,1,..., (4.3.13) 


faa: “8 


To see why this solution of the least squares problem is unsatisfactory, 
consider the special case w(x) = 1, [a, b] = [0,1]. Then the linear system be- 
comes 

’ “i (x)x'd i=0,1 4.3.14 
hisjet = [fe ba i=0,1,...,” (4.3.14) 
The matrix of coefficients is the Hilbert matrix of order n+ 1, which was 
introduced in (1.6.9). The solution of linear system (4.3.14) is extremely sensitive 
to small changes in the coefficients or nght-hand constants. Thus this is not a 


good way to approach the least squares problem. In single precision arithmetic 
on an IBM 3033 computer, the cases n > 4 will be completely unsatisfactory. 


4.4 Orthogonal Polynomials 


As is evident from graphs of x” on [0, 1], n = 0, these monomials are very nearly 
linearly dependent, \ooking much alike, and this results in the instability of the 
linear system (4.3.14). To avoid this problem, we consider an alternative basis for 
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the polynomials, based on polynomials that are orthogonal in a function space 
sense that is given below. These results are of fundamental importance in the 
approximation of functions and for much work in classical applied mathematics. 

Let w(x) be the same as in (4.3.8) and (4.3.9), and define the inner product of 
two continuous functions f and g by 


(f,g) = fw) @)a() dx f,g€C[a,d] (4.4.1) 


Then the following simple properties are easily shown. 


l. (af, 2) =(f, ag) = af, g) for all scalars « 


2 (fi +f 8) = (hs 8) + (fr 8) 
(f, 8 + 82) = (Ff, 81) + (CF, 82) 
3%. (f.g)=(8,f) 


4. (f, f) = 9 for all f © Ca, b], and (f, f) = 0 if and only if 
f(x)=0,asx<b 


Define the two norm or Euclidean norm by 


Ila = Po FOP x =f) (4.4.2) 


This definition will satisfy the norm properties (4.1.6)—(4.1.8). But the proof of 
the triangle inequality (4.1.8) is no longer obvious, and it depends on the 
following well-known inequality. 


Lemma (Cauchy-Schwartz inequality) For f, g € C[a, b], 


If, 8) 1s Wf llallgtt (4.4.3) 


Proof if g = 0, the result is trivially true. For g # 0, consider the following. 
For any real number a, 


O<(ft+ag,ftag)=(f,f) + 2a(f,g) + a7(g, g) 


The polynomial on the right has at most one real root, and thus the 
discriminant cannot be positive, 


4\(f,g)|"— 4(F, Ag, 8) <0 


This implies (4.4.3). Note that we have equality in (4.4.3) only if the 
discriminant is exactly zero. But that implies there is an a* for which the 
polynomial is zero. Then 


(f+ atg, ft atg) =0 
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and thus f= —a*g. Thus equality holds in (4.3.4) if and only if either 
(1) f is a multiple of g, or (2) f or g is identically zero. | 


To prove the triangle inequality (4.1.8), 
If + sli = (f+ 8, f+ 8) = (fF) + 20f, 8) + (8,8) 


<|FIE+ 2fllallglle + gil? = (fla + Hella)” 


Take square roots of each side of the inequality to obtain 


f+ gle < Uf ll2 + Wall, (4.4.4) 


We are interested in obtaining a basis for the polynomials other than the 
ordinary basis, the monomials {1, x, coer We produce what is called an 
orthogonal basis, a generalization of orthogonal basis in the space R” (see 
Section 7.1). We say that f and g are orthogonal if 


(f;2¢)=0 (4.4.5) 
The following is a constructive existence result for orthogonal polynomials. 


Theorem 4.2 (Gram—Schmidt) There exists a sequence of polynomials 
{9,(x)|]n = 0} with degree (p,) = n, for all n, and 


(2m) =9 forall n#m nymz>0 (4.46) 
In addition, we can construct the sequence with the additional 


properties: (1) (9,,9,) = 1, for all n; (2) the coefficient of x” in 
@,(x) is positive. With these additional properties, the sequence 


{@, } is unique. 


Proof We show a constructive and recursive method of obtaining the members 
of the sequence; it is called the Gram—Schmidt process. Let 


Po(x) = ¢ 


a constant. Pick it such that ||po||, = 1 and c > 0. Then 


(90, 90) =e? f'w(x) de = 1 


c= fiw) as| exe 


For constructing ,(x), begin with 


v(x) =x+ Ay oPo(x) 


210 APPROXIMATION OF FUNCTIONS 


Then 
(¥1,%) =O implies 0 = (x, 99) + 4; 9(G , Po) 
- Pxow(x) dx 
aAo= —(x, Qo) = fon ae 
lf w(x) ax| 
Define 
= b,(x) 
m0) = 


and note that 
[lPill2 = 1 (P1, Po) = 0 


and the coefficient of x is positive. 
To construct 9,(x), first define 


W,(x) le ia Fn n-1Pn—1(X) hess +4, oPo(x) (4.4.7) 


and choose the constants to make y, orthogonal to p, for j = 0,..., 
n—- 1. Then 


(v,.9) =0 implies a, ,=—-(x",9) j=0,1,....2-1 (4.4.8) 
The desired ,(x) is 


v(x) 
Pall 


Continue the derivation inductively. 


n(x) = (4.4.9) 


Example For the special case of w(x) = 1, fa, b] = [—1, 1], we have 


yx) = i (x) = fi x p(x) aE Bera) 


and further polynomials can be constructed by the same process. 


There is a very large literature on these polynomials, including a variety of 
formulas for them. Of necessity we only skim the surface. The polynomials are 
usually given in a form for which ||q,||. # 1. 


Particular Cases 1. The Legendre polynomials. Let w(x) =1 on [—1,]]. 
Define 


(-1)" a" 
Pix) = Sap” Ge 


[(a—x?)"] n21 (4.4.10) 
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with P(x) = 1. These are orthogonal on [-1,1], degree P(x) =n, and 
P,(1) = 1 for all xn. Also, 


2 
(Pa Pr) 2n+1 
2n+1 
9, (x) = 5 P (x) (4.4.11) 


2. The Chebyshev polynomials. Let w(x) =1/ V1 —x?, -1 <x <1. Then 
T, (x) = cos(ncos~! x) n>0 (4.4.12) 


is an orthogonal family of polynomials with deg (7,,) = n. To see that 7,(x) 1s a 
polynomial, let cos~! x = 6,0 < 8 < 7. Then 


T, (x) = cos(n + 1)@ = cos (n@) cos 6 F sin(n@) sin?” 
T, (x) + T,_1(x) = 2cos(n@) cos 6 = 27, (x)x 
Tras(x) = 22T,(x)- T(x) 221 (4.4.13) 
Also by direct calculation in (4.4.12), 
T(x)=1 7,(x) =x 
Using (4.4.13) 
T,(x) =2x?-1  1,(x) See 1) — x = 4x3 — 3x 


The polynomials also satisfy 7,,(1) = 1, n = 1, 


0 n#m 
Gane. (4.4.14) 
my n=m>0 


The Chebyshev polynomials are extremely important in approximation theory, 
and they also arise in many other areas of applied mathematics. For a more 
complete discussion of them, see Rivlin (1974), and Fox and Parker (1968). We 
give further properties of the Chebyshev polynomials in the following sections. 


3. The Laguerre polynomials. Let w(x) = e~%*, [a, b] = [0, 00). Then 


1 d” 
LAX) = ieee aa ize") n2>O0 (4.4.15) 
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\|Z,llo = 1 for all n, and {Z,} are orthogonal on [0, 00) relative to the weight . 


function e~*. 


We say that a family of functions is an orthogonal family if each member ts 
orthogonal to every other member of the family. We call it an orthonormal family 
if it is an orthogonal family and if every member has length one, that is, 
fll, = 1. For other examples of orthogonal polynomials, see Abramowitz and 
Stegun (1964, chap. 22), Davis (1963, app.), Szego (1968). 


Some properties of orthogonal polynomials These results will be useful later in 
this chapter and in the next chapter. 


Theorem 4.3 Let {@,(x)|n => 0} be an orthogonal family of polynomials on 
(a, b) with weight function w(x). With such a family we always 
assume implicitly that degree p, =n, n> 0. If f(x) is a poly- 
nomial of degree m, then 


AIG) 
iG} xX (,, we Cae 


Proof We begin by showing that every polynomial can be written as a combina- 


tion of orthogonal polynomials of no greater degree. Since degree (~)) = 
0, we have g(x) = c, a constant, and thus 


l= 


p(x) 


Since degree(p,) = 1, we have from the construction in the 
Gram—Schmidt process, 


9,(x) = C1 + Cy, oPo(x) C1, #0 


x= aE ory) _ c1,0Po(*)| 


C11 
By induction in the Gram-—Schmidt process, 
p,(x) = ane a C,. p-19r-1(x) apres +, oPo(x) Cher # 0 


and 


1 
bis Cc [p,(x) ~~ Cr, r-1P,-1(X) aa od —¢,,0P0(*)| 


Thus every monomial can be rewritten as a combination of orthogonal 
polynomials of no greater degree. From this it follows easily that an 
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arbitrary polynomial f(x) of degree m can be written as 
f(x) = Bm (x) + ++ + boPo(x) 
for some choice of bo,..., 5,,- To calculate each b,, multiply both sides 


by w(x) and »,(x), and integrate over (a, 6). Then 


(7,9) = 3 b( 9), 9;) = bi( gs %;) 


J=0 


= (f, ¢;) 
(%;, 9) 


which proves (4.4.16) and the theorem. a 


Corollary If f(x) is a polynomial of degree < m — 1, then 


(f,%,) = 90 (4.4.17) 


and ,,(x) is-orthogonal to f(x). 


Proof Xt follows easily from (4.4.16) and the orthogonality of the family 
{P,(x)}. a 


The following result gives some intuition as to the shape of the graphs of the 
orthogonal polynomials. It is also crucial to the work on Gaussian quadrature in 
Chapter 5. 


Theorem 4.4 Let {9,(x)|n = 0} be an orthogonal family of polynomials on 
(a, b) with weight function w(x) => 0. Then the polynomial »,(x) 
has exactly n distinct real roots in the open interval (a, b). 
Proof Let x,, X2,...,X,, be all of the zeros of (x) for which 
lL a<x;<b 
2. ,(x) changes sign at x; 
Since degree(p,) =n, we trivially have m < n. We assume m <n and 
then derive a contradiction. 
Define 
B(x) = (x ~ x4) +--+ (x — xm) 
By the definition of the points x,,..., x,,, the polynomial 
,(x) B(x) = (x a x;) ae (x = Xm) Pn(X) 


does not change sign in (a, b). To see this more clearly, the assumptions 
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on X,,...,X,, imply 
n(x) = A(x)(x — 4)" ++ (x — x) 
with each r, odd and with A(x) not changing sign in (a, b). Then 
a(x) B(x) = (x — x,)97) +++ (x — x,,)"* h(x) 
and the conclusion follows. Consequently, 


['w()B(@) on) dx #0 


since clearly B(x),(x) # 0. But since degree. (B) = m < n, the corollary 
to Theorem 4.3 implies 


f(x) B(x)e,(2) dx = (B,9,) =0 


This is a contradiction, and thus we must have m =n. But then the 
conclusion of the theorem will follow, since o,(x) can have at most n 
roots, and the assumptions on x,,..., x, imply that they must all be 
simple, that is, p’(x;) # 0. i) 


As previously, let {,(x)|” = 0} be an orthogonal family on (a, b) with 
weight function w(x) = 0. Define A, and B, by 


@, (x) = A,x" + Bx" b+ +. (4.4.18) 

Also, write | 
n(x) = An(X — Xn) — Xn) 077 (= Xan) (4.4.19) 

Let | 
a= = (%9,) > 0 (4.4.20) 


Theorem 4.5 (Triple Recursion Relation) Let {,} be an orthogonal family of 
polynomials on (a,b) with weight function w(x) > 0. Then for 


n>, 
Pn 41(%) az (a,x 2 b,) P, (x) = C1Pn—1(*) (4.4.21) 
with 
B B Ash. P 
tna, [Fe 2 ate pes (4.4.22) 
Anst A,, A, Yn-1 


Proof First note that the triple recursion relation (4.4.13) for Chebyshev 
polynomials is an example of (4.4.21). To derive (4.4.21), we begin by 
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considering the polynomial 
G(x) = Pn4i(X) s A, XP, (x) 
= [A,gix") + Bayt] 


A 
a x [Anx" + Bx" l+.-- ] 


= [Bo - Auth) + 
and degree (G) <n. By Theorem 4.3, we can write 
G(x) = d,9,(x) + +++ +dopo(x) 
for an appropriate set of constants do,..., d,. Calculating d,, 


~ ts = “U(r @;) — a, (x9, 9;)| (4.4.23) 


We have (9,41, 9;) = 0 for i < n, and 
b 
(x@,,9;) = f w(x) 9,(x)x9;(x) dx = 0 


for i<n-—2, since then degree(xg,(x)) <n — 1. Combining these 
results, 


d,=0 O<i<n-2 


and therefore 
G(x) ae d,(x) + d,-1Pn-1(X) 
Pn 4X) = (a,x + d,,) ,(X) + d,-1P,—\(X) (4.4.24) 


This shows the existence of a triple recursion relation, and the remaining 
work is manipulation of the formulas to obtain d, and d,_,, given as b, 


and c, in (4.4.22). These constants are not derived here, but their values 
are important for some applications (see Problem 18). | 


Example 1. For Laguerre polynomials, 


1 
n+1 


fm nk i aa1t- Gs ——L,.1(x) (4.4.25) 


For Legendre polynomials, 


2n+1 n 
Pi 44(x) = are xP (x) = Tepe) (4.4.26) 
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Theorem 4.6 (Christoffel-Darboux Identity) For’{@,} an orthogonal family of 
polynomials with weight function w(x) > 0, 


> PAX) PCY) — Pn) CY) = Pal X) Prarl) 
k=0 


x#y 
Yk a,¥,(x — y) 
(4.4.27) 


Proof The proof is based on manipulating the triple recursion relation [sce 
Szego (1967, p. 43)}. 


4.5 The Least Squares Approximation Problem 
(continued) 


We now return to the general least squares problem, of minimizing (4.3.10) 
among all polynomials of degree <n. Assume that {y,(x)|k = 0} is an ortho- 
normal family of polynomials with weight function w(x) > 0, that is, 


iy _~/l n=m 
(Pus Pm) 0) {4 nm 


Then an arbitrary polynomial f(x) of degree< n can be written as 
r(x) = bopo(x) + +++ +b,,(x) (4.5.1) 
For a given f € C[a, 5], 
lf r= f "(a 60) - ¥de(x) | dx = G(bos-..4,) (4.5.2) 
j=0 


We solve the least squares problem by minimizing G. 
As before, we could set 


But to obtain a more complete result, we proceed in another way. For any choice 
of bo,..., 5,, 


0<G(by,.-..b,) =|f- Loo, f- Lobe; 


j=0 i=0 


= (ff) ae Eas, @;) B Lbib(¢:, @;) 


mist 2B 5 9) + yo 


j=0 


= (1/13 - E (f.9) y+ E [he bj]? (4.5.3) 


j=0 
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which can be checked by expanding the last term. Thus G is a minimum if and 
only if 

b=(f.9) f=9,1,...,0 


Then the least squares approximation exists, is unique, and is given by 


Gtx) = x (fs 9) (x) 


j=0 


Moreover, from (4.5.2) and (4.5.3), 


_ 1/2 
f-alh= in = 0 


j=0 
= VANS — Well (4.5.4) 
WAUS = Wetd + WS - tl (4.5.5) 


-AS-a practical note in obtaining 1,*,,(), use 


rEa(x) = (x) + (fF, Pri) Pail) (4.5.6) 

Theorem 4.7 Assuming [a, b] is finite, 
Limit|| f — r* ||. = 0 (4.5.7) 

n-* 00 ¢ 
Proof By definition of r* as a minimizing polynomial for ||f — 7,ll2, we have 
If - rb =Wf-Fla= o> SWS Flee o> (4.5.8) 


Let «> be arbitrary. Then by the Weierstrass theorem, there is a 
polynomial Q(x) of some degree, say m, for which 


Max | f(x) ~ (x)| < 7 om yf f(x) ax 


By the definition of r*(x), 


If — rflla <I — ll = [fw eolre) - o(x)]? ax] 


gh. Shi 
< leo «| =€ 


Combining this with (4.5.8), 


lf - la se 


for all n > m. Since € was arbitrary, this proves (4.5.7). a 


218 APPROXIMATION OF FUNCTIONS 


Using (4.5.5) and a straightforward computation of ||r*||,, we have Bessel’s 
inequality: 


n 


WelB= X (fe) < IIB (4.5.9) 


J=0 


and using (4.5.7) in (4.5.5), we obtain Parseval’s equality, 


co 1/2 
IIflle = | Lf 2) (4.5.10) 


Theorem 4.7 does not say that || f — r*|],, > 0. But if additional differentiability 
assumptions are placed on f(x), results on the uniform convergence of r* to f 
can be proved. An example is given later. 


Legendre Polynomial Expansions To solve the least squares problem on a finite 
interval [a, b] with w(x) = 1, we can convert it to a problem on [—1,1]. The 
change of variable 


b+at+(b-a)t 


4.5.11) . 
converts the interval -1 <1<1toa<x <b. Define 
b+a+(b-a)t 
Eiy =f eee -l<r<l (4.5.12) 


for a given f € C{a, b]. Then 


[PUG ~ alo ae = (7F*) PLO = a (OP a 


with R,(t) obtained from r,(x) using (4.5.11). The change of variable (4.5.11) 
gives a one-to-one correspondence between polynomials of degree m on [a, 5] 
and of degree m on [—1,1], for every m > 0. Thus minimizing |[f— /,||, on 
[a, b] is equivalent to minimizing ||p — R,,||, on [—1,1]. We therefore restrict 
our interest to the least squares problem on [—1, 1]. 

Given f€[-—1,1], the orthonormal family described in Theorem 4.2 is 


P(x) = 1/V2, 


P(X) = a alte s n>1. (4.5.13) 


The least squares approximation is 


r(x) = EUs 9) (x) ce 9;) = [ fe) dx (4.5.14) 
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Table 4.2. Legendre expansion 
coefficients for e* 


1.661985 .037660 
901117 004698 
226302 .000469 


the solution of the original least squares problem posed in Section 4.3. The 
coefficients ( f, p;) are called Legendre coefficients. 


Example For f(x) = e* on[—1, 1], the expansion coefficients ( f, p,) of (4.5.14) 
are given in Table 4.2. The approximation r}(x) was given earlier in (4.3.7), 


written in standard polynomial form. For the average error E in r¥(x), combine 
(4.3.4), (4.5.4), (4.5.9), and the table coefficients to give 


1 
B= let — h(x x) ||, = .0034 


Chebyshev polynomial expansions The weight function is w(x) = 1/y1 — x’, 
and 


Po(x) = = v,(2) = [2 m0) n>] (4.5.15) 


The least squares solution is 


n a. 
Gls) = E (fede ak Ae (4526 
j= 
Using the definition of p,(x) in terms of T,(x), 
2 1 f(x)T (x) dx 
C,(x) = YL a(x) C ==" eae” (4.5.17) 


The prime on the summation symbol means that the first term should be halved 
before beginning to sum. 

The Chebyshev expansion is closely related to Fourier cosine expansions. 
Using x = cos#,0 <@ <7, 


C,(cos 6) = y c; cos ( 8) (4.5.18) 


j=0 


oD cig 
Ge a cos ( 8) f(cos 6) dé (4.5.19) 
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Thus C,(cos @) is the truncation after n + 1 terms of the Fourier cosine expan- 
sion 


f(cos 6) = ’c, cos (/8) 
0 


If the Fourier cosine expansion on [0, 7] of f(cos @) is known, then by substitut- 
ing 8 = cos~' x, we have the Chebyshev expansion of f(x). 

For reasons to be given later, the Chebyshev least squares approximation is 
more useful than the Legendre least squares approximation. For this reason, we 
give a more detailed-convergence theorem for (4.5.17). 


Theorem 4.8 Let f(x) have r continuous derivatives on [—1,1], with r > 1. 
Then for the Chebyshev least squares approximation C,(x) de- 
fined in: (4.5.17), 


Bin 
If - Calle $ n>2 (4.5.20) 


for a constant B dependent on f and r. Thus C,(x) converges 
uniformly to f(x) as n-— co, provided f(x) is continuously 
differentiable. 


Proof Combine Rivlin (1974, theorem 3.3, p. 134) and Meinardus (1967, 
theorem 45, p. 57). 


Example We illustrate the Chebyshev expansion by again considering ap- 
proximations to f(x) = e*. For the coefficients c; of (4.5.17), 


C= af e*T;(x) dx 
J og 


-1 V1 — x? 
2-2 
= =) e° cos ( 76) a8 (4.5.21) 
7 “0 


The latter formula is better for numerical integration, since there are no singulari- 
ties in the integrand. Either the midpoint rule or the trapezoidal rule is an 


Table 4.3 Chebyshev expansion 
coefficients for e* 


cj 

2.53213176 

1.13031821 
27149534 
04433685 
00547424 
00054293 


Uh WN FE OTK. 


THE LEAST SQUARES APPROXIMATION PROBLEM CONTINUED 221 


¥ 


ox C(x) 


—0.00588 


Figure 4.6 Error in cubic Chebyshev least 
squares approximation to e~. 


excellent integration method because of the periodicity of the integrand (see the 
Corollary 1 to Theorem 5.5 in Section 5.4). Using numerical integration we 
obtain the values given in Table 4.3. Using (4.5.17) and the formulas for T(x), 
we obtain 


C(x) = 1.266 + 1.130x 
C,(x) = 994571 + .997308x + .542991x? + .177347x3 
lex — C,(x)|]. = 32 le* — C(x) |], = .00607 (4.5.22) 


The graph of e* — C,(x) is given in Figure 4.6, and it is very similar to Figure 
4.4, the graph for the minimax error. The maximum errors for these Chebyshev 
least squares approximations are quite close to the minimax errors, and are 
generally adequate for most practical purposes. 


The polynomial C,(x) can be evaluated accurately and rapidly using the form. 
(4.5.17), rather than converting it to the ordinary form using the monomials x/, 
as was done previously for the example in (4.5.22). We make use of the triple 
recursion relation (4.4.13) for the Chebyshev polynomials 7,(x). The following 
algorithm is due to C. W. Clenshaw, and our presentation follows Rivlin (1974, 
p. 125). 


Algorithm Chebeval (a, n,x value) 


1. Remark: This algorithm evaluates 
Value = )) a,T,(x) 
j=0 


2 2b 


n+ = naa = 0 z= 2x 


3. Do through step 5 for j =n, n—-1,...,0. 
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4, b = 2b; 4.1 ik bisa + a; 


5. Next /. 
6. Value = (by — b,)/2. 


This is almost as efficient as the nested multiplication algorithm (2.9.8) of 
Section 2.9. We leave the detailed comparison to Problem 25. Similar algorithms 
are available for other orthogonal polynomial expansions, again using their 
associated triple recursion relation. For an analysis of the effect of rounding 
errors in Chebeval, see Rivlin (1974, p. 127), and Fox and Parker (1968, p. 57). 


4.6 Minimax Approximations 


For a good uniform approximation to a given function f(x), it seems reasonable 
to expect the error to be fairly uniformly distributed throughout the interval of 
approximation. The earlier examples with f(x) = e* further illustrate this point, 
and they show that the maximum error will oscillate in sign. Table 4.4 sum- 
marizes the statistics for the various forms of approximation to e* on [—-1, 1], 
including some methods given in Section 4.7. To show the importance of a 
uniform distribution of the error, with the error function oscillating in sign, we 
present two theorems. The first one is useful for estimating p,(/), the minimax 
error, without having to calculate the minimax approximation q*(x). 


Theorem 4.9 (de la Vallée—-Poussin) Let f © C[a, b] and n > 0. Suppose we 
have a polynomial Q(x) of degree < n which satisfies 


f(x) - Q(x) =(-1)fe, f= 0,1,...,n +1 (4.6.1) 
with all e, nonzero and of the same sign, and with 


ASXg<X,< + <X,4, 56 


Then 
Min |e| <o,(f) =f- allo Sf—- ll. (4.6.2) 


Osjsntl 


Table 4.4 Comparison of various linear and cubic approximations to e* 


Maximum Error 


Method of Approximation Linear Cubic 
Taylor polynomial p, (x) 718 0516 
Legendre least squares r*(x) 439 0112 
Chebyshev least squares C(x) .322 00607 
Chebyshev node interpolation formula J, (x) 372 .00666 
Chebyshev forced oscillation formula F(x) .286 00558 


Minimax g*(x) 279 00553 
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Proof The upper inequality in (4.6.2) follows from the definition of p,(/). To 
prove the lower bound, we assume it is false and produce a contradiction. 


Assume 


p,(f)< Min je,| (4.6.3) 


O<j<nt+l 


Then by the definition of p,(f), there is a polynomial P(x) of 
degree < n for which 


Pal f) < If - Pll. < Minje;| (4.6.4) 
Define 
R(x) = Q(x) — P(x) 


a polynomial of degree < n. For simplicity, let all e; > 0; an analogous 


argument works when all e, < 0. 
Evaluate R(x,) for each. j and observe the sign of R(x;,). First, 


R( x9) = O(xo) — P(x0) = [f(%0) — P(%o)] — [F(%0) - Q(*o)] 
= [f(%) — P(x0)] - €0 <0 
by using (4.6.4). Next, 
R(x,) = Q(x) — P(r) = [fq) — PCa)] + €1 > 0 
Inductively, the sign of R(x; ) is(—1)/*1, 7 = 0,1,..., + 1. This gives 
R(x) n+ 2 changes of sign and implies Re )h as n+ 1 zeros. Since 


degree(R) < n, this is not possible unless R = 0. Then P = Q, contrary 
to (4.6.1) and (4.6.4). | 


Example Recall the cubic Chebyshev least squares approximation C,(x) to e* 
on [—1,1], given in (4.5.22). It has the maximum errors on the interval [—1, 1] 
given in Table 4.5. These errors satisfy the hypotheses of the Theorem 4.9, and 
thus 


.00497 < p;(f) < .00607 


Table 4.5 Relative maxima of |e* — C,(x)| 


x e* — C(x) 
— 1.0 00497 
— .6919 — 00511 
0310 00547 
-7229 ~ 00588 


1.0 .00607 
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From this, we could conclude that C,(x) was quite close to the best possible 
approximation. Note that p,(f) = .00553 from direct calculations using (4.2.7). 


Theorem 4.10 (Chebyshev Equioscillation Theorem) Let f € C[a, b] and n 2 0. 
Then there exists a unique polynomial g*(x) of degree < n for 
which 


Or S) =I - oF lleo 


This polynomial is uniquely characterized by the following prop- 
erty: There are at least n + 2 points, 


GS Xo SKS OO SK SO 


for which 


f(x;) - an(x;) = 9(-1)’e,(f) f= 0,1,...,2 +1 (4.6.5) 
with o = +1, depending only on f and n. 


Proof The proof is quite technical, amounting to a complicated and manipula- 
tory proof by contradiction. For that reason, it is omitted. For a 
complete development, see Davis (1963, chap. 7). 


Example The cubic minimax approximation q¥(x) to e*, given in (4.2.7), 
satisfies the conclusions of this theorem, as can be seen from the graph of the 
error in Figure 4.4 of Section 4.2. 


From this theorem we can see that the Taylor series is always a poor uniform 
approximation. The Taylor series error, 
(x = Xo) n+l 


f(x) — p,(x) = ees ae 


ferrn(g,) (4.6.6) 


does not vary uniformly through the interval of approximation, nor does it 
oscillate much in sign. 

To give a better idea of how well q*(x) approximates f(x) with increasing 7, 
we give the following theorem of D. Jackson. 


Theorem 4.11 (Jackson) Let f(x) have & continuous derivatives for some 
k > 0. Moreover, assume /{‘*)(x) satisfies 


Supremum | f(x) — f(y)| < Mix —y|* (4.6.7) 
b 


asx, ys 


for some M > 0 and some 0 < a < 1. [We say f‘*)(x) satisfies a 
Hélder condition with exponent a.] Then there is a constant d,, 
independent of f and n, for which 


Md 
p,(f) < oie n>1 (4.6.8) 
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Table 4.6 Comparison of p, and M, for f(x) = e* 
n 2 3 4 5 6 7 


M,(f) 113E-1 14E-2 142E-3 1L18E-4 843E-6 S.27E-7 
p, (f) 450E—2 5.53E-3 S547E-4 452E-—5 321E-6 200E-7 


Proof See Meinardus (1967, Theorem 45, p. 57). Note that if we wish to ignore 
(4.6.7) with a k-times continuously differentiable function f(x), then just 
use k — 1 in place of k in the theorem, with a = 1 and M =||f||.. 
This will then yield 


dy, 
pat) SF lle (4.6.9) 


Also note that if f(x) is infinitely differentiable, then g*(x) converges to . 
f(x) uniformly on [a, 6], faster than any power 1/n*, k > 1. a 


From Theorem 4.12 in the next section, we are able to prove the following 
result. If f(x) is n + 1 times continuously differentiable on [a, b], then 


b-a)/2]"*? 
PnA(f) < eae ie ll, = M,(f) (4.6.10) 


The proof is left as Problem 38. There are infinitely differentiable functions for 
which M,(/) — oo. However, for most of the widely used functions, the bound 
in (4.6.10) seems to be a fairly accurate estimate of the magnitude of p,(f). This 
estimate is illustrated in Table 4.6 for f(x) = e* on [—1,1]. For other estimates 
and bounds for p,(f), see Meinardus (1967, sec. 6.2). 


4.7 Near-Minimax Approximations 


In the light of the Chebyshev equioscillation theorem, we can deduce methods 
that often give a good estimate of the minimax approximation. We begin with the 
least squares approximation C,(x) of (4.5.17). It is often a good estimate of 
4n(x), and our other near-minimax approximations are motivated by properties 
of C(x). 

From (4.5.17), 


Cola) = Eos) ja (eas (4.7.1) 


with the prime on the summation meaning that the first term (j = 0) should be 
halved before summing the series. Using (4.5.7), if f © C[—1,1], then 


ioe] 


f(x) = X eiT(x) (4.7.2) 


j= 


226 APPROXIMATION OF FUNCTIONS 


with convergence holding in the sense that 


2 
dx = 0 


Lint ' omer |/ (x) ~ Liat(x) 


j=0 


For uniform convergence, we have the quite strong result that 


pal 1) SI ~ Gilg s [4+ =In(n) 0,6) (4.73) 


For a proof, see Rivlin (1974, p. 134). Combining this with Jackson’s result 
(4.6.9) implies the earlier convergence bound (4.5.20) of Theorem 4.8. 

If f € C’fa, b], it can be proved that there is a constant c, dependent on f 
and r, for which 


i eg 
je| < i j21 (4.7.4) 


This is proved by considering the c; as the Fourier coefficients of f(cos @), and 
then by using results from the theory of Fourier series on the rate of decrease of 
such coefficients. Thus as r becomes larger, the coefficients c, decrease more 
rapidly. 

For the truncated expansion C,(x), 


fe) —¢(x) = Les) =cahalo (4.7.5) 


n+l 


if c,., # 0 and if the coefficients c; are rapidly convergent to zero. From the 
definition of T,.,(x), 


[Tai(x)[<1 -tsx<1 (4.7.6) 


Also, for the n + 2 points 


bd p=Q,1 +1 4.7.7 
= cos | j=0,1,...,2 (4.7.7) 

we have 
Taye (ty (4.7.8) 


The bound in (4.7.6) is attained at exactly n + 2 points, the maximum possible 
number. Applying this to (4.7.5), the term c,,.,7,4+ (x) has exactly n + 2 relative 
maxima and minima, all of equal magnitude. From the Chebyshev equioscillation 
theorem, we would therefore expect C,(x) to be nearly equal to the minimax 
approximation q*(x). 
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Example Recall the example (4.5.22) for f(x) = e* near the end of Section 4.5. 
In it, the coefficients c, decreased quite rapidly, and 


}ex — Callao = .00607 eT, (x) = .00547T, (x) lex — Q3 {lo = -00553 
This illustrates the comments of the last paragraph. 


Example It can be shown that 


a? a 
tan“! x = aan) -S7(x)+ ST(x)--- | 79) 
converging uniformly for - | =x <1, witha = V2 — | = 0.414. Then 
a ee 2n+1 
Cons (x) = 207, (x) - 5 Bt) bot et ae Da Tug) | CATA) 


For the error, 


Bona (X) = tan™* x = Cons1(X) = (-1)"*a?"*7T;,,43(x) 


2n+3 


a (—1)/a?/*3 
+2 ———_—— 
yy 2j+1 


junt2 


9 j+1(%) 


We bound these terms to estimate the error in C,,,,(x) and the minimax error | 


p 2a+ if f ). 
By taking upper bounds, 


a) -1 Fy2ith 2 g2itl 2 qt 
ia ee <2 > : Roe 3 
eto. ey se ao ay eek Qnt¢5 L-o@w 


Thus we obtain upper and lower bounds for E,,,,,(~), 


; Qa2"* 
n+ * 
Eanelt) $ Tp g(“ Ya Tana) * Gay Syd — a) 
2o2"*3 “aey 
< seed al Toy e3(x) + -207] 


using a2/(1 — a?) = .207. Similarly, 
q27t3 


2n+3 


Ean i(X) 2 [(-1)"" Tan aa(x) = 207 
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Thus 


2n +3 


Eyy 40) = rarer an ea ere (4.7.11) 


which has 2n + 4 relative maxima and minima, of alternating sign. By Theorem 
4.9 and the preceding inequalities, 


2(.793) a2"+3 2(1.207) a2"*3 


In 43 S PanailS) = Pans2(S) S nt 3 (4.7.12) 


For the practical evaluation of the coefficients c, in (4.7.1) use the formula 
2 ps 
c= =f cos ( j@) f(cos @) dé (4.7.13) 
J t 44 


and the midpoint or the trapezoidal numerical integration rule. This was il- 
lustrated earlier in (4.5.21). 


Interpolation at the Chebyshev zeros If the error in the minimax approxima- 
tions is nearly c,,,7,,,(x), as derived in (4.7.5), then: the error should be nearly 
zero at the roots of T,,,(x) on [—1,1]. These are the points 


f=Ol... yn (4.7.14) 


Let I(x) be the polynomial of degree < m that interpolates f(x) at these nodes 
{x,}. Since it has zero error at the Chebyshev zeros x, of T,,.;(x), the continuity 
of an interpolation polynomial with respect to the function values defining it 
suggests that I(x) should approximately equal C,(x), and hence also qg*(x). 
More precisely, write [,(x) in its Lagrange form and manipulate it as follows: 


n 


12) = Espo) 
= EG (a(x) + ¥ (6) - GMb) 
= G(x) 
I(x) + C(x) (4.7.15) 


because f(x,;) — C,(x; ) = Cra Tnai(Xy) = 

The term I,,(x) can be calculated using ne algorithms Divdif and Interp of 
Section 3.2. A note of caution: If c,,, = 0, then the error f(x) — q*(x) is likely 
to have as approximate ‘zeros those of a higher degree Chebyshev polynomial, 
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usually 7, ,,(x). This is likely to happen if /(x) is either odd or even on [—1, 1] 
(see Problem 29). [A function f(x) is even {odd} on an interval [—a, a] if 
f(-x) = f(x) {f(—x) = —f(x)} for all x in [—a, a].] The preceding example 
with f(x) = tan7! x is an apt illustration. 

A further motivation for considering I,(x) as a near-minimax approximation 
is based on the following important theorem about Chebyshev polynomials. 


Theorem 4.12 For a fixed integer n > 0, consider the minimization problem: 


= Infi M. m4 4.7.16 
ao Tatra | eMax e Ola ly  “fat16) 


with Q(x) a polynomial. The minimum 7, is attained uniquely by 
letting 

1 
Qn-t 


x"+ Q(x) = T,(x) (4.7.17) 


defining Q(x) implicitly. The minimum is 


1 
n gral 


(4.7.18) 
Proof We begin by considering some facts about the Chebyshev polynomials. 
From the definition, 7)(x) = 1, T\(x) = x. The triple recursion relation 
Tiai(x) = 2x0 (x) — T,_,(x) 
is the basis of an induction proof that 
T,(x) = 2"~1x" + lower degree terms n=l (4:7.19) 


Thus 


1 
jeat F(x) = x" + lower degreeterms n21 (4.7.20) 
Since T(x) = cos(n@), x =cos6, 0<@<7, the polynomial T,(x) 
attains relative maxima and minima at n + 1 points in [—1, 1]: 

ju 
x, = 00s (=| j=0,1,...,2 (4.7.21) 


n 


Additional values of j do not lead to new values of x; because of the 
periodicity of the cosine function. For these points, 


T,(x,)=(-1)) j=0,1,...,4 (4.7.22) 
and 


~L=x,<%x,_)< 07: <x) <x ,=1 
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The polynomial 7,(x)/2"~' has leading coefficient 1 and 


M 
Max get Tix} = (4.7.23) 
Thus 7, < 1/2"~*. Suppose that 
1 
T, < TS (4.7.24) 


We show that this leads to a contradiction. The assumption (4.7.24) and 
the definition (4.7.16) imply the existence of a polynomial 


M(x)=x"+Q(x)  degree(Q) <n-1 


with 


1 
Te Max IMO) < Saar (4.7.25) 


-l<x 


Define 
R(x) = sa Ta(x) ~ M(x) 


which has degree < n — 1. We examine the sign of R(x,) at the points of 
(4.7.21). Yang (4.7.22) and (4.7.24), 


R(x) = R(1) = — ~M(1)>0 


R(x,) = a — M(x,) = -\s aaa M(x) <0 


and the sign of R(x,) is (—1)/. Since R has n + 1 changes of sign, R 
must have at least n zeros, But then degree(R) <n implies that R = 0; 
thus M = (1/2""1)T.. 

To prove that no polynomial other than (1/2"~")T,,(x) will minimize 
(4.7.16), a variation of the preceding proof is used. We omit it. ei 


Consider now the problem of determining n + 1 nodes x, in [—1, 1], to be 
used in constructing an interpolating polynomial p,(x) that is to approximate 
the given function f(x) on [—1, 1]. The error in p,(x) is 


(x — X9) +: (4 — *n) 


f(x) — p(x) = (n+ i)! 


fer) (4.7.26) 


The value f°*(&,) depends on {x,}, but the dependence is not one that can be 
dealt with explicitly. Thus to try to make ||f — p,||,, as small as possible, we 
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Table 4.7 Interpolation data for f(x) = e* 
i x; f(x;) Sf [Xo1-6 5 %] 


i 
0 .923880 2.5190442 2.5190442 
1 382683 1.4662138 1.9453769 
2 — 382683 6820288 7047420 
3 923880 3969760 1751757 


consider only the quantity 


Max |(x — x9) --:(x-x,)| (4.7.27) 


~Isxsl 
We choose {x,;} to minimize this quantity. The polynomial in (4.7.27) is of 
degree n + 1 and has leading coefficient 1. From the preceding theorem, (4.7.27) 
is minimized by taking this polynomial to be T,,,(x)/2”, and the minimum 
value of (4.7.27) is 1/2”. The nodes {x,} are the zeros of T,,,.,(x), and these are 
given in (4.7.14). With this choice of nodes, p, = J,, and 


J 
ly tlle Gapmlife” ll (4.7.28) 


Example Let f(x) =e* and x = 3. We use the Newton divided difference 
form of the interpolating polynomial. The nodes, function values, and needed 
divided differences are given in Table 4.7. By direct computation, 


Max |e* — I,(x)| = .00666 (4.7.29) 
-l<x<l 


whereas the bound in (4.7.28) is .0142. A graph of e* — J,(x) is shown in 
Figure 4.7. 


— 0.00624 


Figure 4.7 e* — I(x). 
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The error ||f — J,||,, is generally not much worse than p,(f). A precise result 
is that 


If — Lille < “ log(n + 1) + 2}p,(f) n>=0 (4.7.30) 


For a proof, see Rivlin (1974, p. 13). Actual numerical results aré generally better 
than that predicted by the bound in (4.7.29), as is illustrated by (4.7.29) in the 
preceding examples of f(x) = e* 


Forced oscillation of the error Let f(x) © C[—1,1], and define 
F(x) = 3) ey, (Te) -lsxs] (4.7.31) 
A= 


Let 


TTS Xq4) Sy <r SX < XQ <1 


be nodes, the choice of which is discussed below. Determine the coefficients c,, , 
by forcing the error /(x) — F(x) to oscillate in the manner specified as neces- 
sary by Theorem 4.10: 


f(x) - F(x;)=(-'E,  §=0,1,...,241 (4.7.32) 


We have introduced another unknown, £,, which we hope will be nonzero. There 
are n + 2 unknowns, the coefficients c, 9,...,C,,, and £,, and there are n + 2 
equations in (4.7.32). Thus there is a reasonable chance that a solution exists. If a 
solution exists, then by Theorem 4.9, 


[E,| < en() SP - Fillo (4.7.33) 


To pick the nodes, note that if c,,,7,,,,(%) in (4.7.5) is nearly the minimax 
error, then the relative maxima and minima in the minimax error, f(x) — q*(x), 
occur at the relative maxima and minima of 7, ,,(x). These maxima and minima 
occur when T,, ,(x) = +1, and they are given by 


it 
x, = cos | i=0,1,....2+1 (4.7.34) 
n+1 


These seem an excellent choice to use in (4.7.32) if F,(x) is to look like the 


minimax approximation q*(x). 
The system (4.7.32) becomes 


Yc, .T,(x,) + (-1)'E, = f(x) f= 0,1,...,2 41 (4.7.35) 
k=O 


NEAR-MINIMAX APPROXIMATIONS 233 


Note that 


(2) = 08 (k- 08°*()) = oe | 


Introduce E, = ¢,, 41/2. The system (4.7.35) becomes 


n+1 kia 
ye) ,00s ( |= f(x;,)  i=0,1,...,24+1 (4.7.36) 
k=0 
since 
sad 1) for k=n+1 
cos = (1) or k=n 


The notation ©” means that the first and last terms are to be halved before 
summation begins. 
To solve (4.7.36), we need the following relations: 


n+] j=k=0 or n+1 
ntl ia ika n+1 
i cos = O<j=k<n+1 (4.7.37 
Ecos (a (aa) > J - ( ) 
0 J#k Osj kent 


The proof of these relations is closely related to the proof of the relations in 
Problem 42 of Chapter 3. 

Multiply equation i in (4.7.36) by cos {(ija)/(n + 1)] for some 0 <j <n+1. 
Then sum on i, halving the first and last terms. This gives 


n+l n+l iy. 
ar ai YT 
Ca kd, COS cos 
i, n 


k=0 | im od 


2) = Broom 


Using the relations (4.7.37), all but one of the terms in the summation on k will 
be zero. By checking the two cases j = 0, n+ 1, and0<j<2+1, the same 
formula is obtained: 


atl 
Ges “f eos { ra | O<jsn+1 (4.7.38). 
i=0 
The formula for E,, is 
n+1 
“(-1 4.7. 
. Eee CN (4.7.39) 


There are a number of connections between this approximation F,(x) and the 
Chebyshev expansion C,(x). Most importantly, the coefficients c, , are ap- 
proximations to the coefficients c, in C,(x). Evaluate formula (4.7. 13) for c; by 
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using the trapezoidal rule (5.1.5) with n + 1 subdivisions: 


2 pa 
c= = f f(cos @) cos ( /8) dé 
7 “0 


n+l ior be 
= =F"4(cos 7} -cos( fn |. - = Cy; (4.7.40) 


Tet n n+1 n+1 


It is well-known that for periodic integrands, the trapezoidal numerical integra- 
tion rule is especially accurate-(see Theorem 5.5 in Section 5.4). In addition, it 
can be shown that 


Cag = Op + Cons ry—y t Camry + Cana ry—y °° (4.7.41) 


If the Chebyshev coefficients c, in (4.7.2) decrease rapidly, then the approxima- 
tion F,(x) nearly equals C,(x), and it is easier to calculate than C,(x). 


Example Use f(x) = e*, -1 < x <1, as before. For n = 1, the nodes are 
{x,} = {-1,0,1} 
E, = .272, and 
F,(x) = 1.2715 + 1.1752x (4.7.42) 


For the error, the relative maxima for e* — F,(x) are given in Table 4.8. For 
n = 3, E, = .00547 and 


F,(x) = .994526 + .995682x + .543081x? + .179519x3 (4.7.43) 
The points of maximum error are given in Table 4.9. From Theorem 4.9 
00547 < p,(f) < .00558 
and this says that F,(x) is an excellent approximation to q(x). 
To see that F(x) is an approximation to C,(x), compare the coefficients c, ; 


with the coefficients c; given in Table 4.3 in example (4.5.22) at the end of 
Section 4.5. The results are given in Table 4.10. 


Table 4.8 Relative maxima of |e* — F,(x)| 


x e* — F(x) 
-— 10 272 
1614 — .286 


1.0 272 
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Table 4.9 Relative maxima of |e* — F,(x)| 
x e* — R(x) 


— 10 00547 


— .6832 — 00552 
.0493 00558 
-7324 — 00554 


1.0 .00547 


Table 4.10 Expansion coefficients for C,(x) and F,(x) to e* 


Cc; ¢, 


. j nj 
0 2.53213176 2.53213215 
1 1.13031821 1.13032142 
2 .27149534 .27154032 
3 .04433685 .04487978 
4 00547424 E, = .00547424 


As with the interpolatory approximation J,(x), care must be taken il f(x) i! 
odd or even in {—1,1]. In such a case, choose n as follows: 


even 


If f is { edd I; then chooses n to be eet (4.7.44) 


This ensures c,,, # 0 in (4.7.5), and thus the nodes chosen will be tht correct 
ones. 

An analysis of the convergence of F,(x) to f(x) is given in Shampin (1970), 
resulting in a bound similar to the bound (4.7.30) for J,,(x): 


If — Falleo < #(")P, Cf), [4.7.45] 
with w(n) empirically nearly equal to the bounding coefficient in (4.7.3). Both 
I,(x) and F(x) are practical near-minimax approximations. 


We now give an algorithm for computing F(x), which can then be egluated 
using the algorithm Chebeval of Section 4.5. 


Algorithm Approx (c, E, f,n) 


1. Remark: This algorithm calculates the coefficients c, in 
F(x)= VreT(x) -lsx<1l 
j=0 


according to the formula (4.7.38), and E is calculat! from 
(4.7.39). The term cy should be halved before using abrithm 
Chebeval. 
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2. Creae x; = cos(ia/(n + 1)) 
fi =f(x;) i=0,1,...,.n+1 
3. Do through step 8 for j = 0,1,...,7+ 1. 
4. sum = [fy + (—1)f,4)/2 
5. Do through step 6 for i= 1,..., n. 
6. sum := sum + f,cos(ij7/(n + 1)) 
7. End loop on i. 
8. c, = 2sum/(n + 1) 
9. End loop on j. 
10. E:=c,,,/2 and exit. 


The cosines in step 6 can be evaluated more efficiently by using the trigono- 
metric addition formulas for the sine and cosine functions, but we have opted for 
pedagogical simplicity, since the computer running time will still be quite small 
with our algorithm. For the same reason, we have not used the FFT techniques of 
Section 3.8. 


Discussion of the Literature 


Approximation theory is a classically important area of mathematics, and it is 
also an increasingly important tool in studying a wide variety of modern 
problems in applied mathematics, for example, in mathematical physics and 
combinatorics. The variety of problems and approaches in approximation theory 
can be seen in the books of Achieser (1956), Askey (1975b), Davis (1963), 
Lorentz (1966), Meinardus (1967), Powell (1981), and Rice (1964), (1968). The 
classic work on orthogonal polynomials is Szego (1967), and a survey of more 
recent work is given in Askey (1975a). For Chebyshev polynomials and their 
many uses throughout applied mathematics and numerical analysis, see Fox and 
Parker (1968) and Rivlin (1974). The related subject of Fourier series and 
approximation by trigonometric polynomials was only alluded to in the text, but 
it is of central importance in a large number of applications. The classical 
reference is Zygmund (1959). There are many other areas of approximation 
theory that we have not even defined. For an excellent survey of these areas, 
including an excellent bibliography, see Gautschi (1975). A major area omitted in 
the present text is approximation by rational functions. For the general area, see 
Meinardus (1967, chap. 9) and Rice (1968, chap. 9). The generalization to 
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rational functions of the Taylor polynomial is called the Pade approximation, for 
introductions, see Baker (1975) and Brezinski (1980). For the related area of 
continued fraction expansions of functions, see Wall (1948). Many of the functions 
that are of practical interest are examples of what are called the special functions 
of mathematical physics. These include the basic transcendental functions (sine, 
log, exp, square root), and in addition, orthogonal polynomials, the Bessel 
functions, gamma function, and hypergeometric function. There is an extensive 
literature on special functions, and special approximations have been devised for 
most of them. The most important references for special functions are in 
Abramowitz and Stegun (1964), a handbook produced under the auspices of the 
U.S. National Bureau of Standards, and Erdélyi et al. (1953), a three volume set 
that is often referred to as the “Bateman project.” For a general overview and 
survey of the techniques for approximating special functions, see Gautschi 
(1975). An extensive compendium of theoretical results for special functions and 
of methods for their numerical evaluation is given in Luke (1969), (1975), (1977). 
For a somewhat more current sampling of trends in the study of special 
functions, see the symposium proceedings in Askey (1975b). 

From the advent of large-scale use of computers in the 1950s, there has been a 
need for high-quality polynomial or rational function approximations of the basic 
mathematical functions and other special functions. As pointed out previously, 
the approximation of these functions requires a knowledge of their properties. 
But it also requires an intimaté knowledge of the arithmetic of digital computers, 
as surveyed in Chapter 1. A general survey of numerical methods for producing 
polynomial approximations is given in Fraser (1965), which has influenced the 
organization of this chapter. For a very complete discussion of approximation of 
the elementary functions, together with detailed algorithms, see Cody and Waite 
(1980); a discussion of the associated programming project is discussed in Cody 
(1984). For a similar presentation of approximations, but one that also includes 
some of the more common special functions, see Hart et al. (1968). For an 
extensive set of approximations for special functions, see Luke (1975), (1977). 
For general functions, a successful and widely used program for generating 
minimax approximations is given in Cody et al. (1968). General programs for 
computing minimax approximations are available in the IMSL and NAG libraries. 
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Problems 


1. To illustrate that the Bernstein polynomials p,(x) in Theorem4.] are poor 
approximations, calculate.the fourth-degree approximation p,(x) for f(x) 
= sin(7x), 0 < x <1. Compare it with the fourth-degree Taylor poly- 


nomial approximation, expanded about x = 3. 


foe} 


2 Let S= Y(-1)a |; be a convergent series, and assume that all a; > 0 and 
; ; 


dpe as Si Sa, Bore 


n 


Prove that 


Ss Anyi 


se E(-1)/4 


3. Using Problem 2, examine the convergence of the following ries. Bound 
the error when. truncating after n terms, and note the dependence on x. 
Find the value of for which the error is less than 107°. This problem 
illustrates another common technique for bounding the ermr in Taylor 


series. 
= (=1)/(4x?)! 
(a) J, = Sage 
ai Z (yt) 
co -] J 2j 
oo 
j=l J 


4. Graph the errors of the Taylor series approximations p,(x) to f(x) = 
sin[(7/2)x] on ~1 < x < 1, for n = 1,3,5. Note the behaviaof the error 
both near the origin and near the endpoints. 


10. 


11. 


12. 


13. 
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Obtain a formula for a polynomial g(x) approximating f(x), with the 
formula for q(x) involving a single integral. 


(c) Assume that f(x) is infinitely differentiable on [a, 5], that is, f(x) 
exists and is continuous on [a, 5], for all j = 0. [This does not imply 
that f(x) has a convergent Taylor series on [a, b].] Prove there exists 
a sequence of polynomials { p,(x)|m = 1} for which 


Limit | f — p |, = 0 
for all j = 0. Hint: Use the Weierstrass theorem and part (b). 


Prove the following result. Let f € C*[a, b] with f’(x) > Ofora<x <b. 
If g¥(x) =a 9+ a,x is the linear minimax approximation to f(x) on. 
{a, 5], then 


_ f(b) — fla) _ fla)t+fle) pate es 
i b-a oe 2 -( 2 | b-a 
where c is the unique solution of 
b)- 
pe) = LDH) 


What is p? 


(a) Produce the linear Taylor polynomials to f(x) = In(x)on1 <x <2, 
expanding about x, = 3. Graph the error. 


(b) Produce the linear minimax approximation to f(x) = In(x) on [1, 2]. 
Graph the error, and compare it with the Taylor approximation. 


(a) Show that the linear minimax approximation to ¥1 + x? on [0,1] is 
qf (x) = .955 +.414x 


(b) Using part (a), derive the approximation 


yy? + 2? = .955z + .414y O<sy<z 


and determine the error. 


Find the linear least squares approximation to f(x) = In(x) on [1,2]. 
Compare the error with the results of Problem 10. 


Find the value of a that minimizes 


[let ~ «| dx 
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14. 


15. 


16. 


17. 


18. 


19. 


20. 


What is the minimum? This is a simple illustration of yet another way to 
measure the error of an approximation and of the resulting best approxima- 
tion. 


Solve the following minimization problems and determine whether there is 
a unique value of «@ that gives the minimum. In each case, « is allowed to 
range over all real numbers. We are approximating the function f(x) = x 
with polynomials of the form ax?. 


(a) Min f [x — ax?]? dx 
a Yu] 
(b) Min f° |x — ax?| dx 
a Ju] 
(c) Min Max |x — ax?| 
a -Ilsxsl 
Using (4.4.10), show that {P,(x)} is an orthogonal family and that 


Pall = y2/(2n +1), 20. 


Verify that 
(-1)" _@” 


eo gees 


@, (x) = 


for n > 0 are orthogonal on the, interval [0, co) with respect to the weight 
function w(x) = e7*. (Note: f e-*x™ dx = m! for m= 0,1,2...) 
0 


(a) Find the relative maxima and minima of T,(x) on [—1,1], obtaining 
(4.7.21). 


(b) Find the zeros of T,,(x), obtaining (4.7.14). 


Derive the formulas for b, and c, given in the triple recursion relation in 
(4.4.21) of Theorem 4.5. 


Modify the Gram-—Schmidt procedure of Theorem 4.2, to avoid the normal- 
ization step ?, = vn/\Pall2: 
$,(x) =x" + Dy go Wnva(X)# f28 +b, o%o(x) 


and find the coefficients b 


n, J? 


Ve7e.0>7k 


Using Problem 19, find Wo, $,, b2 for the following weight functions w(x) 
on the indicated intervals [a, b]. 


(a) w(x)=In(x), O<x<1 


1 , 
Hint OU x"méy\de —f—-1/in +171 on > 0 
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5. 


Let f(x) be three times continuously differentiable on [—«a, a] for some 
a > 0, and consider approximating it by the rational function 


a+ bx 
R(x) = 1+ cx 


To generalize the idea of the Taylor series, choose the constants a, b, and c 
so that 


RAO) =fP0) f= 0,1,2 


Is it always possible to find such an approximation R(x)? The function 
R(x) is an example of a Pade approximation to f(x). See Baker (1975) and 
Brezinski (1980). 


Apply the results of Problem 5 to the case f(x) = e*, and give the resulting 
approximation R(x). Analyze its error on [—1,1], and compare it with the 
error for the quadratic Taylor polynomial. 


By means of various identities, it is often possible to reduce the interval on 
which a function needs to be approximated. Show how to reduce each of 
the following functions from — oo < x < co to the given interval. Usually a 
few additional, but simple, operations will be needed. 


(a) e* 0O<x<l 
(b) cos(x) O<x<a/4 
(c) tan-4(x) O<x<l1 


(d) In(x) 1 <x < 2. Reduce from In(x) on 0 < x < 0. 


(a) Let f(x) be continuously differentiable on [a,b]. Let p(x) be a 
polynomial for which 


lif’ — Plleo ¥ € 


and define 
q(x) = f(a) + fop(t) at a<x<b 


Show that g(x) is a polynomial and satisfies 


If — lo < €(b - a) 


(b) Extend part (a) to the case where f(x) is N times continuously 
differentiable on [a, b], N > 2, and p(x) is a polynomial satisfying 


(Salas 


21. 


22. 
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(b) w(x)=x O<x<1 


(ec) w(x) = V1 —- x? -lsx<l 


Let {p,(x)|1 = 1} be orthogonal on (a, b) with weight function w(x) > 0. 
Denote the zeros of »,(x) by 


eS 2b 


Prove that the zeros of ,,,(x) are separated by those of ,(x),, that is, 


a< Zn+lntl < Zan < Zn ntl IS 22 n+) < Zan < 21 ntl < 5 


Hint: Use induction on the degree n. Write 9,(x) = A,x" + ---, with 
A, > 0, and use the triple recursion relation (4.4.21) to evaluate the 
polynomials at the zeros of ,(x). Observe that the sign changes for 


P,—1(x) and @, 4 1(x). 


Extend the Christoffel-Darboux identity (4.4.27) to the case with x = y, 
obtaining a formula for 


2 
[o.(x )] 
y Si 
k=0 = Yk 
Hint: Consider the limit in (4.4.27) as y > x. 


ran 


Let f(x) = cos™!(x) for -—1 <x <1 (the principal branch 0 </f <7). 
Find the polynomial of degree two, 


p(x) = ay + a,x + ayx? 
which minimizes 


1 [f(x) = p(x)’ 
iz Wage dx 


1 
Define S,(x) = pe plete), n> 0, with T,,,(x) the Chebyshev poly- 


nomial of degree n+ 1. The polynomials S,(x) are called Chebyshev 
polynomials of the second kind. 


(a) Show that (S,(x)|n > 0} is an orthogonal family on [-1,1] with 


respect to the weight function w(x) = ¥1 — x?. 


(b) Show that the family {S,(x)} satisfies the same triple recursion 
relation (4.4.13) as the family {T,(x)}. 
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25. 


26. 


27. 


29. 


31. 


(c) Given f € C[—1,1]], solve the problem 
Min f vi — x?[ f(x) — P,(x)]? ax 
-1 


where p,(x) is allowed to range over all polynomials of degree < n. 


Do an operations count for algorithm Chebeval of Section 4.5. Give the 
number of multiplications and the number of additions. Compare this to 
the ordinary nested multiplication algorithm. 


Show that the framework of Sections 4.4 and 4.5 also applies to the 
trigonometric polynomials of degree < n. Show that the family 
(1, sin (x), cos (x),..., sin (”x), cos(nx)} is orthogonal on [0,27]. Derive 
the least squares approximation to f(x) on [0,27] using such polynomials. 


. [Letting n — oo, we obtain the well-known Fourier series (see Zygmund 


(1959))]. 


Let f(x) be a continuous even (odd) function on [-a, a]. Show that the 
minimax approximation qg*(x) to f(x) will be an even (odd) function on 
{—a, a], regardless of whether n is even or odd. Hint: Use Theorem 4.10, 
including the uniqueness result. 


Using (4.6.10), bound p,(/) for the following functions f(x) on the given 
interval, n = 1,2,...,10. 


(a) sin(x) 0O<x<27/2 
(b) In(x) lsx<e 

(c) tan-'(x) O<x<a/4 
(d) e*% O<x<l 


For the Chebyshev expansion (4.7.2), show that if f(x) is even (odd) on 
[-1,1], then c; = 0 if j is odd (even). 


For f(x) = sin{(7/2)x], -1<x <1, find both the Legendre and 
Chebyshev least squares approximations of degree three to f(x). Determine 
the error in each approximation and graph them. Use Theorem 4.9 to 
bound the minimax error p,(f). Hint: Use numerical integration to sim- 
plify constructing the least squares coefficients. Note the comments follow- 
ing (4.7.13), for the Chebyshev least squares approximation. 


Produce the interpolatory near-minimax approximation I,,(x) for the fol- 
lowing basic mathematical functions f on the indicated intervals, for 
n=1,2,...,8. Using the standard routines of your computer, compute the 
error. Graph the error, and using Theorem 4.9, give upper and lower 
bounds for p,(f). 


(a) e* O<x<l (b) sin(x) O<x<a/2 


(c) tan-(x) O<x<il (d) In(x) ls<x<2 


32. 


33. 


35. 


37. 


38. 


39. 
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Repeat Problem 31 with the near-minimax approximation F,(x). 
Repeat Problem 31 and 32 for 


x sin (t) eae 


] 
flx) = = [at > 


Hint: Find a Taylor approximation of high degree for evaluating f(x), then 
use the transformation (4.5.12) to obtain an approximation problem on 
[—-1, 1]. 


For f(x) = e* on {~1,1], consider constructing J,(x). Derive the error 
result 


a,|Ta1(x)| <{ f(x) — J,(x)| < BlTiai(x)| -1<x<1 


for appropriate constants a,, §,. Find nonzero upper and lower bounds for 


PnC Sf). 


(a) The function sin(x) vanishes at x = 0. In order to better approximate 
it in a relative error sense, we consider the function f(x) = sin(x)/x. 
Calculate the near-minimax approximation J,(x) on0 < x < 7/2 for 
n= 1,2,...,7, and then compare sin(x) — x/,(x) with the results of 
Problem 31(b). 


(b) Repeat part (a) for f(x) = tan™'(x), O<x<l. 


Let f(x) = a,x"+ +++ ta,x + ag, a, # 0. Find the minimax approxima- 
tion to f(x) on [—1,1] by a polynomial of degree < n — 1, and also find 


Pn—iCf)- 


Let a = Min [ Max [x6 — x3 — p.(x) | where the minimum is taken over 
|xIs1 a 


all polynomials of degree < 5. 
(a) Find a. 
(b) Find the polynomial p,(x) for which the minimum a is attained. 


Prove the result (4.6.10). Hint: Consider the near-minimax approximation 
T(x). 


(a) For f(x) = e* on[~—1, l], find the Taylor polynomial p,(x) of degree 
four, expanded about x = 0. 


(b) Using Problem 36, find the minimax polynomial m, ,(x) of degree 
three that approximates p,(x). Graph the error e~ — m,,(x) and 
compare it to the Taylor error e* — p(x) shown in Figure 4.1. The 
process of reducing a Taylor polynomial to a lower degree one by this 
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40. 


41. 


42. 


process is called economization or telescoping. It is usually used 
several times in succession, to reduce a high-degree Taylor polynomial 
to a polynomial approximation of much lower degree.. 
Using a standard program from your computer center for computing 
minimax approximations, calculate the minimax approximation q*{x) for 
the following given functions f(x). on the given interval. Do this for 
n= 1,2,3,...,8. Compare the results with those for problem 31. 
(a) e* O0<x<l (b)  sin(x) O<x<an/2 
(c) tan (x) O<xel (d) In(x) 1<x<2 


Produce the minimax approximations q*(x), n = 1,3,5,7,9, for 
f= feta ~l<x<l 
: <x< 


Hint: First produce a Taylor approximation of high accuracy, and then use 
it with the program of Problem 40. 


Repeat Problem 41 for 


FIVE 


NUMERICAL INTEGRATION 


In this chapter we derive and analyze numerical methods for evaluating definite 
integrals. The integrals are mainly of the form 


Uf) = ff(x) x (5.0.1) 


with [a, b] finite. Most such integrals cannot be evaluated explicitly, and with 
many others, it is often faster to integrate them numerically rather than evaluat- 
ing them exactly using a complicated antiderivative of f(x). The approximation 
of I(f) is usually referred to as numerical integration or quadrature. 

There are many numerical methods for evaluating (5.0.1), but most can be 
made to fit within the following simple framework. For the integrand f(x), find 
an approximating family { f,(x)|n = 1} and define 


b 
Lf) = WAC) dx =I(f,) (5.0.2) 

We usually require the approximations f,(x) to satisfy 
If—-fillo 70 as n> oo (5.0.3) 


And the form of each /,(x) should be chosen such that J(f,) can be evaluated 
easily. For the error, 


Ef) =1(0) ~ (I) = fC) ~f2)] ae 
EAD s [U(e) ~ fle) ]d < (6- a) Sills (6.0.4) 


Most numerical integration methods can be viewed within this framework, 
although some of them are better studied from some other perspective. The one: 
class of methods that does not fit within the framework are those based on 
extrapolation using asymptotic estimates of the error. These are examined in 
Section 5.4. 
249 
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Most numerical integrals J,(f) will have the following form when they’ are 
evaluated: 


Lf) = Dmef(% jn) 221 (5.0.5) 
j=l 
The coefficients w, ,, are called the integration weights or quadrature weights; and 
the points x, ,, are the integration nodes, usually chosen in [a, 5]. The depen- 
dence on n is usually suppressed, writing w, and x j although it will be 
understood implicitly. Standard methods have nodes and weights that have 
simple formulas or else they are tabulated in tables that are readily available. 
Thus there is usually no need to explicitly construct the functions f,(x) of (5.0.2), 
although their role in defining I,,(/) may be useful to keep in mind. 
The following example is a simple illustration of (5.0.2)—-(5.0.4), but it is not of 
the form (5.0.5). 


Example Evaluate 


1e*-1 
I= f de (5.0.6) 


This integrand has a removable singularity at the origin. Use a Taylor series for 
e~ [see (1.1 ae of Chapter 1] to define f,(x), and then define 


iz xi 
I, = ——~ dx 
if xX i! 
a 
= 5.0.7 
LH a 
For the error in J, use the Taylor formula (1.1.4) to obtain 
ets 
(DALO=— oa 
for some 0-< &, < x. Then 
1 x" 
I-L= {| ——ed 
In i, Gee: 
1 e 
(5.0.8) 


GPiiGgi) =) 8" G@anins D 


The sequence in (5.0.7) is rapidly convergent, and (5.0.8) allows us to estimate the 
error very accurately. For example, with n = 6 


I, = 1.31787037 
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| and from (5.0.8) 
| 2.83 x 107° < I- I, < 7.70 x 10~° 
The true error is 3.18 x 107%. 
For integrals in which the integrand has some kind of bad behavior, for 


example, an infinite value at some point, we often will consider the integrand in 
the form 


eae [ws dx (5.0.9) 


The bad behavior is assumed to be located in w(x), called the weight function, 
and the function f(x) will be assumed to be well-behaved. For example, consider 
evaluating 


[a x) f(x) dx 


for arbitrary continuous functions f(x). The framework (5.0.2)-(5.0.4) gener- 
alizes easily to the treatment of (5.0.9). Methods for such integrals are considered 
in Sections 5.3 and 5.6. 

Most numerical integration formulas are based on defining f,(x) in (5.0.2) by 
using polynomial or piecewise polynomial interpolation. Formulas using such 
interpolation with evenly spaced node points are derived and discussed in 
Sections 5.1 and 5.2. The Gaussian quadrature formulas, which are optimal in a 
certain sense and which have very rapid convergence, are given in Section 5.3. 
They are based on defining f(x) using polynomial interpolation at carefully 
selected node points that need not be evenly spaced. 

Asymptotic error formulas for the methods of Sections 5.1 and 5.2 are given 
and discussed in Section 5.4, and some new formulas are derived based on 
| extrapolation with these error formulas. Some methods that control the integra- 
tion error in an automatic way, while remaining efficient, are given in Section 5.5. 
Section 5.6 surveys methods for integrals that are singular or ill-behaved in some 
sense, and Section 5.7 discusses the difficult task of numerical differentiation. 


5.1 The Trapezoidal Rule and Simpson’s Rule 


We begin our development of numerical integration by giving two well-known 
numerical methods for evaluating 


I(f) = [1) dx | (5.1.1) 


ee ee | 


We analyze and illustrate these methods very completely, and they serve as an 
introduction to the material of later sections. The interval [a, b] is always finite in 
this section. 
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y 


a b 
Figure 5.1 Illustration of trapezoidal rule. 


The trapezoidal rule The simple trapezoidal rule is based on approximating 
f(x) by the straight line joining (a, f(a)) and (b, f(6)). By integrating the 
formula for this straight line, we obtain the approximation 


n= (“*)it@) +70) (5.12) 


This is of course the area of the trapezoid shown in Figure 5.1. To obtain an error 
formula, we use the interpolation error formula (3.1.10): 


_ (b= x)f(a) + (x= a) f(b) _ 


f(x) — 


(x — a)(x — b)f[a, 6, x] 


We also assume for all work with the error for the trapezoidal rule in this section 
that f(x) is twice continuously differentiable on [a, b]. Then 


ECS) = f'f(x) ax - ST OESIO) 


= f'(x- a)(x-5)f[a, 6, x] dx (5.1.3) 
Using the integral mean value theorem [Theorem 1.3 of Chapter 1], 


E\(f) = fla, 5,8] f(x — a)(x - 5) dx some a<xt<b 


=[5rem|[-Ze-a] some ne La, 5) 
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using (3.2.12). Thus 


B(f)=- Po yn) ye [a8] (5.14) 


If b — a is not sufficiently small, the trapezoidal rule (5.1.2) is not of much 
use. For such an integral, we break it into a sum of integrals over small 
subintervals, and then we apply (5.1.2) to each of these smaller integrals. Let 
n>1,h=(b-—a)/n, and x;= a+ jh for 7 = 0,1,...,. Then 


Wp) = Als) ae= Df" Hx) ax 


- zy (F]uG-» + f(x,)| = ar] 


with x;_; < ; < x,. There is no reason why the subintervals [x,;_,, x,;] must all 
have equal length, but it is customary to first introduce the general principles 
involved in this way. Although this is also the customary way in which the 
method is applied, there are situations in which it is desirable to vary the spacing 
of the nodes. 

The first terms in the sum can be combined to give the composite trapezoidal 
rule, 


L(f)=hlthtAtht-etheitt,] 21 (5.15) 


with f(x;) = f;. For the error in [,(f), 


E(N=N)- LU) = & - Sry) 


nll o 
ae ae 2 rron)| (5.1.6) 
For the term in brackets, 
1 n 
Min f"(x)<M=— )) f"(n,;) < Max f”(x) 
asxsb n j=l asx<bh 


Since f”(x) is continuous for a < x < b, it must attain all values between its 
minimum and maximum at some point of [a,b]; thus f’(7) = M for some 
7 & [a, 5]. Thus we can write 


E(f)=-S oO pn) some ne lab] (5:7) 
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Another error estimate can be derived using this analysis. From (5.1.6) 


a) eee Oe 
Limit he - inst] 5 Est) 


1 n é 
= sy Limit D 7"(n,)h 
no j=l 


Since Xj-1 51; 5x; j =1,..., 7, the last sum is a Riemann sum; thus 


1 pp 
pimit AP? = 5 fpn(x) de = - 2170) -F(@)} 6.18) 
ECS) * U0) Pa) = BP) (5.19) 


The term E,(f) is called an asymptotic error estimate for E,(f), and is valid in 
the sense of (5.1.8). 


Definition Let E,(f) be an exact error formula, and let E,(f) be an estimate 
of it. We say that E,,(f) is an asymptotic error estimate for E,,(f) if 


=1 (5.1.10) 


or equivalently, 


ES) - Ef) | 
an EAE 


The estimate in (5.1.9) meets this criteria, based on (5.1.8). 


The composite trapezoidal rule (5.1.5) could also have been obtained by 
replacing f(x) by a piecewise linear interpolating function /,(x) interpolating 
f(x) at the nodes xo, x,,...,x,- From here on, we generally refer to the 
composite trapezoidal rule as simply the trapezoidal rule. 


Example We use the trapezoidal rule (5.1.9) to calculate 


T= [ex cos(x) dx . (5.1.11) 
) 


The true value is J = —(e” + 1)/2 = —12.0703463164. The values of J, are 
given in Table 5.1, along with the true errors E, and the asymptotic estimates E,, 
obtained from (5.1.9). Note that the errors decrease by a factor of 4 when n is 
doubled (and hence # is halved). This result was predictable from the multiplying 
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Table 5.1 Trapezoidal rule for evaluating (5.1.11) 


n I, E,, Ratio E, 
2 ~ 17.389259 5.32 4.20 4.96 
4 — 13.336023 1.27 4.06 1.24 
8 ~12,382162 3.12E —1 : 4.02 . 3.10E-1 
16 ~ 12.148004 71.TTE — 2 4.00 7.76E—2. 
32 —12.089742 1.94E — 2 4.00 1.94E — 2 
64 —12.075194 4.85E — 3 4.00 4.85E — 3 
128 —12.071558 1.21E — 3 4.00 1.21E — 3. 
256 —12.070649 3.03E-4 © 400 3.03E — 4 


512 —12.070422 . T.STE — 5 : T.57TE — 5 


factor of h? present in (5.1.7) and (5.1.9); when h is halved, h? decreases by a 
factor of 4. This example also shows that the trapezoidal rule is relatively 
inefficient when compared with other methods to be developed in this chapter. 


Using the error estimate E,( f), we can define an improved numerical integra- 
tion rule: 


CT,(f) = 1,(f) + £,(/) 
1 1 aoe 
=A shot ht that gh} — ql fs) s(a)] (6.1.12) 


This is called the corrected trapezoidal rule. The accuracy of E,( f) should make 
CT,(f) much more accurate than the trapezoidal rule. Another derivation of 
(5.1.12) is suggested in Problem 4, one showing that (5.1.12) will fit into the 
approximation theoretic framework (5.0.2)-(5.0.4). The major difficulty of using 
CT,(f) is that f’(a) and f’(b) are required. 


Example Apply CT,,(f) to the earlier example (5.1.11). The results are shown in. 
Table 5.2, together with the errors for the trapezoidal rule, for comparison. 
Empirically, the error in CT,(f) is proportional to A*, whereas it was propor- 
tional to h? with the trapezoidal rule. A proof of this is suggested in Problem 4. 


Table 5.2 The corrected trapezoidal rule for (5.1.11) 
n : CT,(f) Error Ratio Trap Error 


2 —12.425528367 3.55E—1 144 3.32 

4 — 12.095090106 2.47E — 2 1 5, 6 1.27 

8 — 12.071929245 L.58E — 3 15.9 3.12E —1 
16 —12.070445804 9.95E — 5 16.0 1.77TE — 2 
32 — 12.070352543 6.23E — 6 1 6.0 1.94E — 2 
64 —12.070346706 3.89E — 7 ; 4.85E — 3 


128 —12.070346341 2.43E — 8 ces L21E — 3 
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y 


¥ = pa x) 


4a (a+b)/2 b 
Figure 5.2 Illustration of Simpson’s rule. 


Simpson’s rule To improve upon the simple trapezoidal rule (5.1.2), we use a 
quadratic interpolating polynomial p,(f) to approximate f(x) on [a, b]. Let 
c=(a+b)/2, and define 


_ fe (x — c)(x — b) (x — a)(x — b) 
(sf) = f ae =aGanl. G=aeen 


, = Ol = 4) 


(b-a)(b-c) ro) ad 


Carrying out the integration, we obtain 
h a+b b-a 
I(f) = 5) f(a) + 4(] + f(b) h=->— (5.1.13) 


This is called Simpson’s rule. An illustration is given in Figure 5.2, with the 
shaded region denoting the area under the graph of y = p,(x). 
For the error, we begin with the interpolation error formula (3.2.11) to obtain 


E,(f) = I(f) =] L(f) 


- [e —a)(x-c)(x-b)fla,b,c,x]Jdx (5.1.14) 


We cannot apply the integral mean value theorem since the polynomial in the 
integrand changes sign at x = c = (a + b)/2. We will assume f(x) is four times 
continuously differentiable on [a, b] for the work of this section on Simpson’s 
tule. Define 


w(x) = [o —a)(t—c)(t—b) dt 


, ot 
pee eae 


| 
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It is not hard to show that 
w(a)=w(b)=0 wl(x)>0 for a<x<b 


Integrating by parts, 


E,(f) = [wf la, bye, x] a 


[w(x)f[a, b,c, x1752 md [vorte. b,c, x] dx 


— f’w(x)fla, b,c, x, x] dx 


The last equality used (3.2.17). Applying the integral mean value theorem and 
(3.2.12), 


E(f) = —fla,b,c ££] f'w(x) dx some a << 


fO(n)[ 4 | b-a 
=-S- qe]. A= some ne [a8]. 


Thus 
ho 
E(f)=— fa) 1 € [4,8] (5.1.15) 


From this we see that E,(f) = 0 if f(x) is a polynomial of degree < 3, even 
though quadratic interpolation is exact only if f(x) is a polynomial of degree at 
most two. The additional fortuitous cancellation of errors is suggested in Figure 
5.2. This results in Simpson’s rule being much more accurate than the trapezoidal 
rule. 

Again we create a composite rule. For m > 2 and even, define a = (b— a)/n, 
x;= a+ jh for j = 0,1,...,”. Then 


Wf) = f'(2) &x = t [O° f(a) a 


Jul *2j-2 


¥ h hh 
» {3 Le 2+ 4fij-1 + fr;| = Zia)| 


j=l 


with x, j-2. S 1; S X2;- Simplifying the first terms in the sum, we obtain the 
composite Simpson rule: 


h 
1,(f) = 3lf +4f, +2, +4fot2fyt ++ 4+2f,2+4f-1+h) (5.1.16) 
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As before, we will simply call this Simpson’s rule. It is probably the most 
well-used numerical integration rule. It is simple, easy to use, and reasonably 
accurate for a wide variety of integrals. 

For the error, as with the trapezoidal rule, 


AX\n/2) 2% 


= _ SS ee 14) ; 
E,(f) = Wf) - Lf) 0 ew 
h*(b - a) . ; 
E,(f)-=- F290 fO(n) some 7 € [a, 5] (5.1.17) 
We can also derive the asymptotic error formula 
h4 
E,(f) = — Fyg Lf(b) - £(a)] = BC) (5.1.18) 


The proof is essentially the same as was used to obtain (5.1.9). 


Example We use Simpson’s rule (5.1.16) to evaluate the integral (5.1.11), 
I= [e* cos (x) dx 
0 


used earlier as an example for the trapezoidal rule. The numerical results are 
given in Table 5.3. Again, the rate of decrease in the error confirms the results 
given by (5.1.17) and (5.1.18). Comparing with the earlier results in Table 5.1 for 
the trapezoidal rule, it is clear that Simpson’s rule is superior. Comparing with 
Table 5.2, Simpson’s rule is slightly inferior, but the speed of convergence is the 
same. Simpson’s rule has the advantage of not requiring derivative values. 


Peano kernel error formulas There is another approach to deriving the error 
formulas, and it does not result in the derivative being evaluated at an unknown 
point 7. We first consider the trapezoidal rule. Assume {’ © C[a, 5] and that 
f(x) is integrable on [a, 5]. Then using Taylor’s theorem [Theorem 1.4 in 


Table 5.3 Simpson’s rule for evaluating (5.1.11) 


n i E, Ratio E, 
2 — 11.5928395534 —4.78E — 1 #5 —1.63 
4 — 11.9849440198 —8,54E — 2 ae ~1.02E -1 
8 — 12.0642089572 —6.14E — 3 Ke ~6.38E — 3 
16 — 12.0699513233 -3.95E — 4 ie -3.99E - 4 
32 ~12.0703214561 -2.49E — 5 rep —2.49E - 5 
64° ~ 12.0703447599 —1.56E — 6 166 -1.56E — 6 


128 — 7 12.0703462191 -9.73E — 8 —9.73E — 8 
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Chapter 1], 
f(x) = pix) + R(x) p(x) = f(a) + (x - 2) f(a) 
R(x) = f(x ~ 1s") ae 


Note that from (5.1.3), 
| E(F+G)=E,(F) + £(G) (5.1.19) 
for any two functions F,G € C[a, b]. Thus 
E,(f) = E\(p,) + E,(R2) = £,(R2) 


since £,( p,) = 0 from (5.1.4). Substituting, 


E,(R2) = R(x) dx — (~}LRa2) + R,(b)| 


b (* 7 b-a b ‘ 
=f fe-orwa- (SF) fPe-oro 
In general for any integrable function G(x, 7), 
bX b rb 
G(x,t) dtdx = G(x,t) dxdt 5.1.20 
[ [Glad ade = f’ Pols) (5.1.20) 
Thus 
b b b—a\ rs, 
BR) = fo fe axa - (S*) fo-or a 
Combining integrals and simplifying the results, 
eas 
Ef) = sf r'(N(t— a)(t~ b) at (5.1.21) 


For the composite trapezoidal rule (5.1.5), 


E,(f) = feore dt (5.1.22) 


Z . 
K(t)= Z(t —toal(t-t) t.4<tst;  j=1,2,...,n (5.1.23) 


The formulas (5.1.21) and (5.1.22) are called the Peano kernel formulation of the 
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error, and K(t) is called the Peano kernel. For a more general presentation, see 


Davis (1963, chap. 3). 
As a simple illustration of its use, take bounds in (5.1.22) to obtain 


EN siKta fi Ola= = fide (5.1.24 


If f’(t) is very peaked, this may give a better bound on the error than (5.1.7), 
because in (5.1.7) we generally must replace |f’’(7)| by || /’||..- 
For Simpson’s rule, use Taylor’s theorem to write 


f(x) = p;(x) + R,(x) 


R(x) = 2 f(x - 97%) a 
6 a 
As before 
E,(f) = E,(p3) + E,(R,) = E,(R,) 
and we then calculate E,(R,) by direct substitution and simplification: 


E,(f) : [Ral dx ~ =| Ra(a) 4 4n,(“—] + R4(6)| 


This yields 


E,(f) = f’K(1) f(t) at (5.1.25) 
a(t a)%(3r- a ~ 28) asi<**° 

K(1)={ 5 er (5.1.26) 
ay (b - 1)"(b + 2a - 32) ST eee 


A graph of K(z) is given in Figure 5.3. By direct evaluation, 


h4 b ho b-a 
os ee eee pe 
72 [OK (1) at 90 2 


As with the composite trapezoidal method, these results extend easily to the 


composite Simpson rule. 
The following.examples are intended to describe more fully the behavior of 


Simpson’s and trapezoidal rules. 


IK llo = 
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Figure 53 The Peano kernel for 
Simpson’s rule. 


Example 1. 


fi) = ef Vx. [a 6) = (0. 1,7 = 


wolnm 


Table 5.4 gives the error for increasing values of n. The derivative f(x) is 
singular at x = 0, and thus the formula (5.1.17) cannot be applied to this case. As 
an alternative, we use the generalization of (5.1.25) to the composite Simpson rule 
to obtain 


; h*\/ 105. 35 
IE) < Khe f'1FC) [= | =\(F) soe 


Thus the error should decrease by a factor of 16 when h is halved (ie., » is 
doubled). This also gives a fairly realistic bound on the error. Note the close 
agreement of the empirical values of ratio with the theoretically predicted values 
of 4 and 16, respectively. 


ad 


f(x) = : [a,b] =[0,5] 1 = 2.33976628367 


oa awe 
+ (x-— 7) 


Table 5.4 Trapezoidal, Simpson integration: case (1) 


Trapezoidal Rule Simpson’s Rule 

n Error Ratio Error Ratio 

2 —7.197 -—2 — 3.370 — 3 

- ° ~1.817 —2 ss 2315-4 mae 

8 ~~ 4,553 — 3 4.00 —1:543 — 5 153 
16 -1.139 -—3 4.00 —1.008 -—6 - 155 
32 ~2.848 — 4 4.00 — 6.489 — 8 is7 ; 
64 -7.121 — 5 : —4.141 — 9 : 


128 — 1.780 — 5 aye + —2.626 — 10 
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Table 5.5 Trapezoidal, Simpson integration: case (2) 


Trapezoidal Rule Simpson’s Rule 
n Error Ratio Error Ratio 
2 1.731 -1 — 2.853 —1 
4 7.110 - 2 a 3.709 — 2 ae 
8 7.496 — 3 oe —1.371 — 2 ay 
16 1.953 — 3 3.99 1.059 — 4 9.81 
32 4.892 —4 4.00 1.080 — 6 16.0 
64 1.223 -— 4 4.00 6.743 — 8 160 
128 3.059 — 5 ; 4217-9 , 


According to theory, the infinite differentiability of f(x) implies a value for ratio 
of 4.0 and 16.0 for the trapezoidal and Simpson rules, respectively. But these 
need not hold for the first several values of I, as Table 5.5 shows. The integrand 
is relatively peaked, especially its higher derivatives, and this affects the speed of 
convergence. 


w 


(hee Taal =o] I=5 


Since f’(x) has an infinite value at x = 0, none of the theoretical results given 
previously apply to this case. The numerical results are in Table 5.6; note that 
there is stil] a regular behavior to the error. In fact, the errors of the two methods 
decrease by the same ratio as n is doubled. This ratio of 2! + 2.83 is explained 
in Section 5.4, formula (5.4.24). 


4. 
f(x) =e) ~— [a,b] = [0,27] I = 7.95492652101284 


The results are shown in Table 5.7, and they are extremely good. Both methods 


Table 5.6 Trapezoidal, Simpson integration: case (3) 


Trapezoidal Rule Simpson’s Rule 

n Error : Ratio Error Ratio 

2 6.311 — 2 2.860 — 2 ; 

4 2.338 — 2 oy 1.012 -— 2 aN 

8 8.536 — 3 277 3.587 — 3 283 
16 3.085 — 3 778 1.268 — 3 783 
32 1.108 — 3 > 80 4.485 — 4 783 
64 3.959 — 4 : 1.586 — 4 : 


128 1.410 — 4 ah ‘5.606 — 5 aes 
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Table 5.7 Trapezoidal, Simpson integration: case (4) 


Trapezoidal Rule Simpson’s Rule 

n Error Ratio Error Ratio 

2 ale 5.06E + 1 pales 1,35 

4 ~3.44E -— 2 5.34E -1 

2.75E + 4 4.64E +1 
=i ES > 1.25E +8 LES 2 2.16E + 4 
16 < 1.00E — 14 , 4.17E -7 = 417 sy 
< 1.00E — 14 . 


are very rapidly convergent, with the trapezoidal rule superior to Simpson’s rule. 
This illustrates the excellent convergence of the trapezoidal rule for periodic 
integrands; this is analyzed in Section 5.4. An indication of this behavior can be 
seen from the asymptotic error terms (5.1.9) and (5.1.18), since both estimates are 
zero in this case of f(x). 


5.2 Newton—Cotes Integration Formulas 
The simple trapezoidal rule (5.1.2) and Simpson’s rule (5.1.13) are the first two 
cases of the Newton—Cotes integration formula. For n > 1, let h = (b — a)/n, 


x; = a+ jh for j = 0,1,...,n. Define I,(f) by replacing f(x) by its interpolat- 
ing polynomial p,(x) on the nodes x9, X),...,X,! 


n 


K(f) = [1) dx I(t) = f'p,(x) ax (5.2.1) 


Using the Lagrange formula (3.1.6) for p,(x), 


Tf) =f? Oo bia(x) f(y) dx = Dw f(xy) (5.2.2) 
@ j=0 j=0 
with . 
Wnt fialx) de f= 0,1,...57 (5.2.3) 


Usually we suppress the subscript and write just w, We have already 
calculated the cases n = 1 and 2. To illustrate the calculation of the weights, we 
give the case of wo for n = 3. 


x; (x — xy )(x — X)(x - x3) dx 
ii 


b, 
wom Jllz) arm [Oars ay 


A change of variable simplifies the calculations. Let x = xy + ph, O <p < 3. 
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Then 


wo= — asf ~ mx ~ x2) (x ~ 5) 


f -aflu- 1)h(p— 2)h(p — 3)A- hdp 


- aKC —1)(p —2)(u— 3) dp 


Wy = 
The complete formula for n = 3 is 


B(1) =F Usa) + a) + 3fle) +a) (6.2.4) 


and is called the three-eighths rule. 
For the error, we give the following theorem. 


Theorem 5.1 (a) For n even, assume f(x) is m + 2 times continuously differen- 
tiable on [a, b]. Then 


Mf) -L,(f) = Ga"3fe"(q) some 9 © [a,b] (5.2.5) 
with 
1 n 
6 Gai fee (=a) de (5.2.6) 


(b) For n odd, assume f(x) is n + 1 times continuously differen- 
tiable on [a, b}. Then 


(Ff) - LCF) = Gat7fer(n) = some 9 © [a,b] (5.2.7) 
with 
1 


C= Gp Gee Doe en) de 


Proof We sketch the proof for part (a), the most important case. For complete 
proofs of both cases, see Isaacson and Keller (1966, pp. 308-314). From © 
(3.2.11), 

EAS)= HK) -1(/) 


= [e — Xo)(x — x1) °° (x — x, )F [X09 M1502 Xp» X] dx 
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Define 
w(x) = f(t x0) +++ (t= x,) dt 
Then 
w(a)=w(b)=0 ° w(x)>0 for a<x<b 

The proof that w(x) > 0 can be found in Isaacson and Keller (1966, 
p. 309). It is easy to show w(b) = 0, since the integrand (t — xq) --- 
(t— x,) is an odd function with respect to the middle node x, ,. = 


(a + b)/2. 
Using integration by parts and (3.2.17), 


E,(f) = [W102 me ¥ dx 


: d 
= [w(x)f[xp,..-s x,» x]]2- [w@zfbror-. X,>x] dx 


b 
E,(f) = — [w(x)f Lx0s--+5 Xa Xx] dx 
Using the integral mean value theorem and (3.2.12), 


E,(f) = —f[xo0---s ¥n ££] [ow(x) ax 


zs LO FF a) -(t—x,) dtdx (5.2.8) 


(n + 2)! 


We change the order of integration and then use the change of variable 
t=x t+ ph,Ospsn: 


f’w(x) wee Lf meee Vader 
= [Cem x0) (8 xy a(t x4 )R_ = 1) 


ah Paw 1) (wnt (Hn) dp 


Use the change of variable » = n — p to give the result 


pia falls fiw) te aaa AC —v)+++(1—v)p? dp 


Use the fact that n is even and combine the preceding with (5.2.8), to 
obtain the result (5.2.5)—(5.2.6). 
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Table 5.8 Commonly used Newton—Cotes formulas 


b h an ; 
n=1 [ro dx = SI f(a) + f(b) ~ s5/"E) trapezoidal rule 


b h a+b a 
n=2 {fe dx = $|K2) +4 5 +s0)| _ of) Simpson's rule 


b 3h 3h 
mad ffx) de = FUla) + f(a + h) + 3f(b~ hy + ON] ~ FOP) 


b 2h a+b 8h’ 
n=4 [foa-% If(a) + 32f(a th) + 12( +") + 32f(b-h) + 71(0)| = rik XE) 


For easy reference, the most commonly used Newton-Cotes formulas are 
given in Table 5.8. For n = 4, I,(f) is often called Boole’s rule. As previously, 
let A = (b — a)/n in the table. 


Definition A numerical integration formula /(f) that approximates I(f) is 
said to have degree of precision m if 


1. I(f) = I(/) for all polynomials f(x) of degree < m. 
2. I(f) #I(f) for some polynomial f of degree m + 1. 


Example With n = 1,3 in Table 5.8, the degrees of precision are also m = n = 
1,3, respectively. But with n = 2,4, the degrees of precision are (m=n+1= 
3, 5, respectively. This illustrates the general result that Newton—Cotes formulas 
with an even index n gain an extra degree of precision as compared with those of 
an odd index [see formulas (5.2.5) and (5.2.7)]. 


. Each Newton-—Cotes formula can be used to construct a composite rule. The 
most useful remaining one is probably that based on Boole’s rule (see Problem 7).: 
We omit any further details. 


Convergence discussion The next question of interest is whether J,(f) con- 
verges to I(f) as n > oo. Given the lack of convergence of the interpolation 
polynomials on evenly spaced nodes for some choices of f(x) [see (3.5.10)], we 
should expect some difficulties. Table 5.9 gives the results for a well-known 
example, . 


I= =2-tan-!(4) + 2.6516 (5.2.9) 


These Newton—Cotes numerical integrals are diverging; and this illustrates the 
fact that the Newton-Cotes integration formulas J,(f) in (5.2.2), need not 
converge to I( f). 

To understand the implications of the lack of convergence of Newton—Cotes 
quadrature for (5.2.9), we first give a general discussion of the convergence of 
numerical integration methods. 


eee 
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Table 5.9 Newton—Cotes example 
(5.2.9) 


3 
a 


COONAN kW 
w 
wo 
iS) 
oO 
oo 


ind 


- Definition Let ¥ bea family of continuous functions on a given interval [a, b]. 


We say Fis dense in C[a, b] if for every f © Cla, b] and every 
€ > 0, there is a function f, in F for which 


Max | f(x) — f(x) |< (5.2.10) 


Example 1. From the Weierstrass theorem [see Theorem 4.1], the set of all 
polynomials is dense in C[a, b]. 


2. Let n>1, h=(b—a)/n, x,=a+Jh, O<j<n. Let f(x) be linear on 
each of the subintervals [X;-15 x,;]. Define ¥ to be the set of all such piecewise 
linear functions f(x) for all n. We leave to Problem 11 the proof that ¥ is dense 
in C[a, b}. . 


Theorem 5.2 Let 
n 
L,(f) “— > We nd (X;, 2) n 2 i 
j=0 
be a sequence of numerical integration formulas that approximate 
I(f) = f°f(x) dx 


Let F bea family dense in C[a, b]. Then 


L(f) > Wf) all fec[a,d] (5.2.11) 


if and only if 

1. L(f) > 1(f) al fEeF (5.2.12) 
and 

2: B = Supremum }’ |w, ,,| < 00 (5.2.13) 


n21 j=0 
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' Proof (a) Trivially, (5.2.11) implies (5.2.12). But the proof that (5.2.11) im- 
plies (5.2.13) is much more difficult. It is an example of the principle of 
uniform boundedness, and it can be found in almost any text on func- 
tional analysis; for example, see Cryer (1982, p. 121). 
(b) We now prove that. (5.2.12) and (5.2.13) implies (5.2.11). Let 


f © C[a, bj be given, and let « > 0 be arbitrary. Using the assumption 
that ¥ is dense in C[a, b], pick f, © ¥ such that 


Male) -LOls BEoaTRy (6224) 
Then write 
Wf)-1,(f) = UC) - 1G) + UG - 1.0) 
+10 - 40] 


It is straightforward to derive, using (5.2.13) and (5.2.14), that 
| (4) — (AE SCA) — 1G +1) ~ Ta 40)| 
| +11) — LCF) 

<= +f) 1,(40)1 


| Using (5.2.12), I,(f.) > I(f,) as n > oo. Thus for all sufficiently large 
n, sayn>n,, 


IW) - Ls 


} 
Since e was arbitrary, this shows I,(f) ~ I(f) asn — oo. | 
i 


i 
1 
} 
i 
i 
} 
i 
} 
H 


Since the Newton—Cotes numerical integrals [,(f) do not converge to I(f) 
for f(x) = 1/(1 + x”) on[—4,4], it must follow that either condition (5.2.12) or 
(5.2.13) is violated. If we choose ¥ as the polynomials, then (5.2.12) is satisfied, 
since I,(p) = I( p) for any polynomial p of. degree < n. Thus (5.2.13) must be 
false. For the Newton-Cotes formulas (5.2.2), 


n 
Supremum )° |w; ,,| = 00 (5.2.15) 
n j=0 


Since I( f) = J,,(f) for the special case f(x) = 1, for any 1, we have 


| Lw,=b-a nel (5.2.16) 
on 
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Combining these results, the weights w,,, must vary in sign as n becomes 


sufficiently large. For example, using 1 = 8, 


4h 
[Ce de = 11) = Fy yys [8 fo + Le) + 5.888(f. +H) 


—928( f, + fe) + 10,496( f, + f) — 4,540f,| 


Such formulas can cause loss-of-significance errors, although it is unlikely to be a 
serious problem until 7m is larger. But because of this problem, people have 
generally avoided using Newton—Cotes formulas for n > 8, even in forming 
composite formulas. 

The most serious problem of the Newton—Cotes method (5.2.2) is that it may 
not converge for perfectly well-behaved integrands, as in (5.2.9). 


The midpoint rule There are additional Newton—Cotes formulas in which one 
or both of the endpoints of integration are deleted from the interpolation (and 
integration) node points. The best known of these is also the simplest, the 
midpoint rule. It is based on interpolation of the integrand f(x) by the constant 
f(a + b)/2); and the resulting integration formula is 


| ['ra) ax = (0-0) () + WS) as). abate 
(5.2.17) 
For its composite font: define 
xj=at+(j-4)h j=1,2,...,0 
the midpoints of the intervals [a + (j — 1)h, a + jh]. Then 
[[1(x) dx = 1,7) + ECS) 

L(f=hl fp tht th] (5.2.18) 
Bf) =F fq) some y€[a,b} (5.219) 


The proof of these results is left as Problem 10. 

These integration formulas in which one or both endpoints are missing are 
called open Newton—Cotes formulas, and the previous formulas are called closed 
formulas. The open formulas of higher order were used classically in deriving 
numerical formulas for the solution of ordinary differential equations. 


270 NUMERICAL INTEGRATION 


5.3 Gaussian Quadrature 


The composite trapezoidal and Simpson rules are based on using a low-order 
polynomial approximation of the integrand f(x) on subintervals of decreasing 
size. In this section, we investigate a class of methods that use polynomial 
approximations of f(x) of increasing degree. The resulting integration formulas 
are extremely accurately in most cases, and they should be considered seriously 
by anyone faced with many integrals to evaluate. 

For greater generality, we will consider formulas 


1S) = Lomaf Sin) * faa) ae= Mf) (83.2) 


The weight function w(x) is assumed to be nonnegative and integrable on [a, 5], 
and it is to also satisfy the hypotheses (4.3.8) and (4.3.9) of Section 4.3. The 
nodes {x,,,} and weights {w, ,} are to be chosen so that I,(f) equals I(f) 
exactly for polynomials f(x) of as large a degree as possible. It is hoped that this 
will result in a formula J,(f) that is nearly exact for integrands f(x) that are 
well approximable by polynomials. In Section 5.2, the Newton—Cotes formulas 
have an increasing degree of precision as n increased, but nonetheless they do not 
converge for many well-behaved integrands. The difficulty with the Newton—Cotes 
formulas is that the nodes (x, ,,} must be evenly spaced. By omitting this 
restriction, we will be able to obtain new formulas J,(f) that converge for all 
fe C{a, bj. 

To obtain some intuition for the determination of I,(f), consider the special 
case 


fi ae wf (x;) (5.3.2) 


j=} 
where w(x) = 1 and the explicit dependence of {w,} and {x,} on 1 has been 


dropped. The weights {w,} and nodes {x,} are to ‘be determined to make the 
error 


EN) = fi f(x) de~ LY wS(x)) (5.3.3) 


j=l 


equal zero for as high a degree polynomial f(x) as possible. To derive equations 
for the nodes and weights, we first note that 


E, (ao + ax + +++ +a,,x") = aoE,(1) + aE, (x) + +++ +4,,£,(x7) (5.3.4) 
Thus E,(/) = 0 for every polynomial of degree < m if and only if 


E,(x')=0  i=0,1,...,m (5.3.5) 
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Case 1. n=1. Since there are two parameters, w, and x,, we consider 
requiring 


E,(1) =0 E,(x) =0 
This gives 


[idx - w, =0 [ xax— wx, = 0 
-1 -1 


This implies w, = 2 and x, = 0. Thus the formula (5.3.2) becomes 
1 ° 
f f(e) dx = 270) 
the midpoint rule. 


Case 2. n=2. There are four parameters, w,, w,, x,,X2, and thus we put 
four constraints on these parameters: 


E, (x!) = [i xiw —[wxi+wxi]}=0 1 =0,1,2,3 
or 
w, +w, = 2 
W xX, + Wx, =0- 
wx? + wyxd = 2 
wx? + w,x3 =0 
These equations lead to the unique formula 


[' f@) ax =(- >| + (>| (5.3.6) 


which has degree of precision three. Compare this with Simpson’s rule (5.1.13), 
which uses. three nodes to attain the same degree of precision. 


Case 3. For a general n there are 2n free parameters {x;} and {w,}, and we 
would guess that there is a formula (5.3.2) that uses n nodes and gives a degree of 
precision of 2m — 1. The equations to be solved are 
E,(x') =0 i=0,1,...,2n—1 

or 

Pe 0 i=1,3,...,2n-—1 

xis 2 ; 5.3.7 
L a j= .-,2n —2 ( ) 
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These are nonlinear equations, and their solvability is not at all obvious. Because 
of the difficulty in working with this nonlinear system, we use another approach 
to the theory for (5.3.2), one that is somewhat circuitous. 

Let {¢,(x)|n 2 0} be the orthogonal polynomials on (a, b) with respect to 
the weight function w(x) 2 0. Denote the zeros of »,(x) by 


a<x,< +++ <x,<b- (5.3.8) 


Also, recall the notation from (4.4.18)—(4.4.20): 


P(x) =A,x" + °° a.= 


Yn = [owl2 eax}? a (5.3.9) 


Theorem 5.3 For each n > 1, there is a unique numerical integration formula 
(5.3.1) of degree of precision 2n ~ 1. Assuming f(x) is 21 times 
continuously differentiable on [a, b], the formula for J,( f) and its 
error is given by 


Yn 


fiv(x) f(x) dx = Ess) + ay f(a) (53.10) 


for some a < 7 < b. The nodes {x;} are the zeros of »,(x), and 
the weights {w,} are given by 


—anYn 


ae a Oey Ee fea Ore a (5.3.11) 
: 5(X;) Ppa i(x;) 
Proof The proof is divided into three parts. We first obtain a formula with 
degree of precision 27 — 1, using the nodes (5.3.8). We then show that it 
is unique. Finally, we sketch the derivation of the error formula and the 
weights. 


(a) Construction of ‘the formula. Hermite interpolation is used as the 
vehicle for the construction (see Section 3.6 to review the notation and 
results). For the nodes in (5.3.8), the Hermite polynomial interpolating 
f(x) and f(x) is 


Hx) = > flephix) + > f'(xphx) (5.3.12) 
j=! j=l. 


with h (x) and h Ax) defined in (3.6.2) of Section 3.6. The interpolation 
error is given by 


E(x) = f(x) — Hy(x) = [bn (x)]?f Leas xo Xpr Sav I 


— [v,(x)]? 


= Gay C8) ‘£6 [4,5] (5.3.13) 
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with 
a(x) = (x — x) +++ (x — x,) 
Note that 
eax) 
A 


n 


p(x) = (5.3.14) 


since both ,(x) and ¥,,(x) are of degree n and have the same zeros. 
Using (5.3.12), if f(x) is continuously differentiable, then 


fw) dx = ['w(2)H,(x) dx + f'w(2) (x) dx 
=1,(f) + E,(f) (5.3.15) 


The degree of precision is at least 2n — 1, since & (x) = 0 if f(x) isa 
polynomial of degree < 2n, from (5.3.13). Also from (5.3.13), 


E, (x2") = ['w(x)6,(x) ax = fo L4 0] ax >0 (5.3.16) 


Thus the degree of precision of I,(f) is exactly 2n — 1. 
To derive a simpler formula for I,(f), 


Iy(J) = YH ay) fm(x)h (2) de + Dep fin dh (x) as 


(5.3.17) 


we show that all of the integrals in the second sum are zero. Recall that 
from (3.6.2), 


h(x) = (x - x,)[4(%)] 
vax) 9n(*) 


alae CHE IC RM CEE ITED) 
The last step uses (5.3.14). Thus 
" nx E(x 
h(x) = (x — x)1(x)1,(x) = Se (5.3.18) 


Since degree (/;) = n — 1, and since »,(x) is orthogonal to all polynomi- 
als of degree < n, we have 


[ah z) dx = sey Ln eal) dx=0 j=l,...,n 
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-- -- The integration formula-(5.3.15) becomes 


[il x) 4x) ae = D wf) + EC) 


j= 


wm f(x) a(x) dx fElaugn 320) 


(b) Uniqueness of formula (5.3.19). Suppose that we have a numerical 
integration formula 


a 


[wl f@) dx = ye vjf(z;) (5.3.21) 


j=l 


that has degree of precision > 2n — 1. Construct the Hermite interpola- 
tion formula to f(x) at the nodes z,,...,z,. Then for any polynomial 
(x) of degree < 2n — 1, 


: f()= Y se)hy(x)+ y p(z)hj(x) deg (f)<2n—1 (5.3.22) 
i= : i= 


where h(x) and h(x) are defined using {z;}. Multiply (5.3.22) by 
w(x), use the assumption on the degree of precision of (5.3.21), and 
integrate to get : 


j= 


¥ f(z) = Tle) fw(xdh(x) dx + D(z) f°w(x h(x) ae 
1 . j=l a j=l a 
| (5.3.23) 


for any polynomial f(x) of degree < 2n — 1. 7 
Let f(x) = h,(x). Use the properties (3.6.3) of h(x) to obtain from 
(5.3.23) that 


0= [wl )hilx) dx i=1,...,n (5.3.24) 


As before in (5.3.18), we can write 


‘ » Ux)o,(x 
ix) = (x-2)[4(2)P = ee 


a) S@Se)Geey- 


Then (5.3.24) becomes 


[Pole (2)U() de=0 P= 1,2,.0.50 


dees: 
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Since all polynomials of degree < n — 1 can be written as a combination 
of /,(x),...,/,(x), we have that w,,(x) is orthogonal to every polynomial 
of degree < n — 1. Using the uniqueness of the orthogonal polynomials 
{from Theorem 4.2], w,(x) must be a constant multiple of (x). Thus 
they must have the same zeros, and 


To complete the proof of the uniqueness, we must show that w, = v,, 
where vu; is the weight in (5.3.21) and w, in (5.3.10). Use (5.3.23) with 
(5.3.24) and f(x) =A,(x). The result will follow immediately, since 
h,(x) is now constructed using {x; }. 


(c) The error formula. We begin by deducing some further useful 
properties about the weights { w,} in (5.3.10). Using the definition (3.6.2) 
of h,(x), : 


wy f'w(x)hi(x) dx = flo(x)[1 — 27% )(x — xT]? ax 
- euen = 215(3;) fw x(x — x )[Lx)]? ae 


The last integral is zero from (5.3.19), since h(x) = (x — x,)[/,(x)/. 
Thus 


w= fox) [(x)]? a > 0 f=1,2,...,.n (5.3.25) 


and all the weights are positive, for all n. 

To construct w,, begin by substituting f(x) = /,(x) into (5.3.20), and 
note that E,(f) = 0, since degree(/;) = n — 1. Then using /,(x,) = 4,;, 
we have 


w= fPw(x)i(x)dx f= 1,..n (5.3.26) 
To further simplify the formula, the Christoffel-Darboux identity (Theo- 
rem 4.6) can be used, followed by much manipulation, to give the 
formula (5.3.11). For the details, see Isaacson and Keller (1966, pp. 
333-334). 


For the integration error, if f(x) is 2n times continuously differentia- 
ble on [a, b], then , 


E,(f) = f'wlx)6(x) dx 
= fw LP PS Bea xaos ns Mes 21 be 


= fey Kine os ye hme Efe) La 2D]? de some £ € [a, 8] 
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‘the last step using the integral mean value theorem. Using (5.3.14) in the - 
last integral, and replacing the divided difference by a derivative, we have 


fen) wg 
NO) ear ia Nal ea 
which gives the error formula in (5.3.10). a 


Gauss—Legendre quadrature For w(x) = 1, the Gaussian formula on [—1, 1] is 
given by 


fe f(x) de = y w,f(x,) (5.3.27) 


with the nodes equal to the zeros of the degree n Legendre polynomial P,(x) on 
{—1, 1]. The weights are 
He j=1,2 (5.3.28) 
We Fo ye i=1,2,...,7 2. 
(n+ 1)Pr(x;)Pia1(%;) 


and 


22"+1( n1)4 f2-(9) f2(9) 


HD Ge Hlene Gay = Gay 63%) 


Table 5.10 Gauss—Legendre nodes 


and weights 
“i xj Ww; 
2 + .5773502692 1.0. 
3 +.7745966692 -5555555556 
0.0 -8888888889 
4 +.8611363116 -3478546451 
+.3399810436 -6521451549 
5 +.9061798459 .2369268851 
+.5384693101 .4786286705 
0.0 5688888889 
6 +.9324695142 .1713244924 
+ .6612093865 .3607615730 
+ .2386191861 .4679139346 
7 +.9491079123 1294849662 
+ .7415311856 -2797053915 
+.4058451514 3818300505 
0.0 -4179591837 
8 +.9602898565 1012285363 
+.7966664774 2223810345 
+ 5255324099 3137066459 


+ .1834346425 3626837834 
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Table 5.11 Gaussian quadrature 


for (5.1.11) 

n I, TT, 

2 —12.33621046570 2.66E — 1 
3 — 12.12742045017 . §.71E - 2 
4 — 12.07018949029 —1.57E — 4 
5 —12,07032853589 —1.78E — 5 
6 —12.07034633110 1.47E —- 8 
7 —12.07034631753 1.14E -— 9 
8 ‘—12.07034631639 —4,25E — 13 


for some —1 < 7 < 1. For integrals on other finite intervals with weight function 
w(x) = 1, use the following linear change of variables: . 


[1 dt = SF J(eREESE dx (5.3.30) 


reducing the integral to the standard interval [—1, 1]. 

For convenience, we include Table 5.10, which gives the nodes and weights for 
formula (5.3.27) with small values of n. For larger values of n, see the very 
complete tables in Stroud and Secrest (1966), which go up to n = 512. 


Example Evaluate the integral (5.1.11), 


i= f "e* cos (x) dx = —12.0703463164 
0 E 


which was used previously in Section 5.1 as an example for the trapezoidal rule 
(see Table 5.1) and Simpson’s rule (see Table 5.3). The results given in Table 5.11 
show the marked superiority of Gaussian quadrature. 


A general error result We give a useful result in trying to explain the excellent 
convergence of Gaussian quadrature. In the next subsection, we consider in more 
detail the error in Gauss—Legendre quadrature. 


Theorem 5.4 Assume [a, 5] is finite. Then the error in Gaussian quadrature, 


EI) = f(x) fe) ax = Ew f(s) 


jal 


satisfies 
[E(f)| < 2| fowls) as | tn(f) 221 (5.3.31) 


with p,,_,({) the minimax error from (4.2.1). 
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Proof E£,(p) = 0 for any polynomial p(x) of degree < 2n ~ 1. Also, the error 
function E,, satisfies 


E,(F+ G)=E,(F) + £,(G) 


for all F,G © C[a, 6). Let p(x) = q%,_,(x), the minimax approxima- 
tion of degree < 2n — 1 to f(x) on [a, 6}. Then 


E,(f) = E,(f) = E,(q%,-1) = EA f~ qi,-1) 


= fw IC) ~ ah-a2)] & — "Dw [ 10) ~ ab ale) 


j=l 
* b = 
[E(A)| SIS — afs—illeo} fw(x) + XY 1s) 
a j=l 
From (5.3.25), all w; > 0. Also, since p(x) = 1 is of degree 0, 


3 Ww, = Pols) dx 


j=l a 


This completes the proof of (5.3.31). | 
From the results in Sections 4.6 and 4.7, the speed of convergence to zero of 
P,,( f ) increases with the smoothness of the integrand. From (5.3.31), the same is 
true of Gaussian quadrature. In contrast, the composite trapezoidal rule will 
usually not converge faster than order h? [in particular, if f(b) — f(a) # 0}, 
regardless of the smoothness of f(x). Gaussian quadrature takes advantage of 
additional smoothness in the integrand, in contrast to most composite rules. 
Example Consider using Gauss—Legendre quadrature to integrate 


I= 7 "e-* dx = .746824132812 (5.3.32) 
0 


Table 5.12 contains error bounds based on (5.3.31), 


[E.(f)| < 2Pon-(F) (5.3.33) 


Table 5.12 Gaussian quadrature 


of (5.3.32) 

n E,(f) (5.3.33) 

i —3.20E ~ 2 1.06E — 1 
2 2.29 — 4 1.33E — 3 
3 9.55E — 6 3.24E — 5 
4 —3.35E -—7 9.24E — 7 
5 6.05E — 9 1.61E — 8 
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along with the true error. The error bound is of approximately the same 
magnitude as the true error. 


Discussion of Gauss—Legendre quadrature We begin by trying to make the 
error term (5.3.29) more understandable. First define 


(m) 
Max WN m>0 (5.3.34) 


-l<x<l m! 


M.= 


For a large class of infinitely differentiable functions f on [—1,1], we have 
Supremum,,,, »M,, < 00. For example, this will be true if f(z) is analytic on the 
region R of the complex plane defined by 


R= {z: |z—x| <1 forsome x, -1< x <1} 


With many functions, M,, > 0 as m — oo, for example, f(x) = e* and cos(x). 
Combining (5.3.29) and (5.3.34), we obtain 


IE.(f)|<e,Mz, n21 i (5.3.35) 


and the size of e, is essential in examining the speed of convergence. 


The term e, can be. made more understandable by estimating it using Stirling’s 
formula, 


ni =e -"n"V2an 
which is true in a relative error sense as 7 — oo. Then we obtain 
7. 
C= 7, as n> (5.3.36) 


This is quite a good estimate. For example, e, = .00293, and (5.3.36) gives the 
estimate .00307. Combined with (5.3.35), this implies 


|E,(f)|< a -M,, (5.3.37) 


which is a correct bound in an asymptotic sense as n — oo. This shows that 
E,(f) — 0 with an exponential rate of decrease as a function of n. Compare this 
with the polynomial rates of 1/n? and 1/n‘* for the trapezoidal and Simpson 
rules, respectively. 

In order to consider integrands that are not infinitely differentiable, we can use 
the Peano kernel form of the error, just as in Section 5.1 for Simpson’s and the 
trapezoidal rules. If f(x) is r times differentiable on [—1,1], with f(x) 
integrable on [—1, 1], then 


E(f)= f° KOS) de n> 5 (5.3.38) 


280 NUMERICAL INTEGRATION 


Table 5.13 Error constants e, , for (5.3.39) 


for an appropriate Peano kernel K,, ,(1). The procedure for constructing K,, (1) 
is exactly the same as with the Peano kernels (5.1.21) and (5.1.25) in Section 5.1. 


From (5.3.38) 
IE,Cf)} s e, -M, 


e = rif [K,..() | de (5.3.39) 
-1 


The values of e, , given in Table 5.13 are taken from Stroud—Secrest (1966, pp. 
152-153). The table shows that for f twice continuously differentiable, Gaussian 
quadrature converges at least as rapidly as the trapezoidal rule (5.1.5). Using 
(5.3.39).and the table, we can construct the empirical bound 


42 
[E(A)|s | Max 17(x) || (5.3.40) 
The corresponding formula (5.1.7) for the trapezoidal rule on [~1, 1] gives 
67 
| Trapezoidal error| < en Max ‘ | f(x) ] 


which is slightly larger than (5.3.40). In actual computation, Gaussian quadrature 
appears to always be superior to the trapezoidal rule, except for the case of 
periodic integrands with the integration interval an integer multiple of the period 
of the integrand, as in Table 5.7. An analogous discussion, using Table 5.13 with 
e,,,4, can be carried out for integrands f(x), which are four times differentiable 
(see Problem 20). 


Example We give three further examples that are not as well behaved as the 
ones in Tables 5.11 and 5.12. Consider 


dx 
7 = [ve dx=2  [%= f °____.. & 233976628367. 
0 o1+(x-7) 


19 = [ *"e-* sin (50x) dx = .019954669278 (5.3.41) 
0 
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Table 5.14 Gaussian quadrature examples (5.3.41) 


n pay 7 Ratio 72 — 7@ 79) — 7 

2 -7.22E ~ 3 3.50E — 1 3.48E ~ 1 

4 —116E - 3 56 -9.19E — 2 ~1.04E -1 

8 —1.69E — 4 a -403E-3 . ~180E-2 

16 —2.30E — 5 ae —6.24E — 7 ~3.34B -—1 

32 —3.00E — 6 ae —2.98E - 1 L16E ~ 1 

64 —3.84E —7 a & 1.53 ~1 
128 —4.85E — 8 a 6.69E ~ 15 


The values in Table 5.14 show that Gaussian quadrature is still very effective, in 
spite of the bad behavior of the integrand. 

Compare the results for J“ with those in Table 5.6 for the trapezoidal and 
Simpson rules. Gaussian quadrature is converging with an error proportional to 
1/n°, whereas in Table 5.6, the errors converged with a rate proportional to 
1/n'°. Consider the integral 


If) = [ “xf (x) dx (5.3.42) 


with a > —1 and nonintegral, and f(x) smooth with f(0) # 0. It has been 
shown by Donaldson and Elliott (1972) that the error in Gauss—Legendre 
quadrature for (5.3.42) will have the asymptotic estimate 


. c(f,a) 
E,(f) * aay (5.3.43) 
This agrees with 1 in Table 5.14, using a = 4. Other important results on 
Gauss—Legendre quadrature are also given in the Donaldson and Elliott paper. 

The initial convergence of I to I@ is quite slow, but as n increases, the 
speed increases dramatically. For n > 64, 1 = I@ within the limits of the 
machine arithmetic. Also compare these results with those of Table 5.5, for 
the trapezoidal and Simpson rules. 

The approximations in Table 5.14 for J are quite poor because the integrand 
is so oscillatory. There are 101 zeros of the integrand in the interval of integra- 
tion. To obtain an accurate value J), the degree of the approximating poly- 
nomial underlying Gaussian quadrature must be very large. With n = 128, 1° is 
a very accurate approximation of J). 


General comments Gaussian quadrature has a number of strengths and weak- 
nesses. 


1. Because of the form of the nodes and weights and the resulting need to use a 
table, many people prefer a simpler formula, such as Simpson’s rule. This 
shouldn’t be a problem when doing integration using a computer. Programs 
should be written containing these weights and nodes for standard values of 
n, for example, n = 2,4, 8,16,...,512 [taken from Stroud and Secrest (1966)}. 
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In addition, there are a number of very rapid programs for calculating the 
nodes and weights for a variety of commonly used weight functions. Among 
the better known algorithms is that in Golub and Welsch (1969). 


2. It is difficult to estimate the error, and thus we usually take 
I-I,=1,—-1, (5.3.44) 


for some m > n, for example, m = n + 2 with well-behaved integrands, and 

m = 2n otherwise. This results in greater accuracy than necessary, but even 

with the increased number of function evaluations, Gaussian quadrature is 
_ still faster than most other methods. 


3. The nodes for each formula J, are distinct from those of preceding formulas 
I,,. and this results in some inefficiency. If I, is not sufficiently accurate, 
based on an error estimate like (5.3.44), then we must compute a new value 
of I. However, none of the previous values of the integrand can be reused, 
resulting in wasted effort. This is discussed more extensively in the last part 
of this section, resulting in some new methods without this drawback. 
Nonetheless, in many situations, the resulting inefficiency in Gaussian 
quadrature is usually not significant because of its rapid rate of convergence. 


4. If a large class of integrals of a similar nature are to be evaluated, then 
proceed as follows. Pick a few representative integrals, including some with 
the worst behavior in the integrand that is likely to occur. -Determine a value 
of n for which J,(f) will have sufficient accuracy among the representative 
set. Then fix that value of n, and use I,(/) as the numerical integral for all 
members of the original class of integrals. 


5. Gaussian quadrature can handle many near-singular integrands very effec- 
tively, as is shown in (5.3.43) for (5.3.42). But all points of singular behavior 
must occur as endpoints of the integration interval. Gaussian quadrature is 
very poor on an integral such as 


[ve — .7| dx 
0 


which contains a singular point in the interval of integration. (Most other 
numerical integration methods will also perform poorly on this integral. ) The 
integral should be decomposed and evaluated in the form 


[VIR~x a+ [Va 7 ax 


Extensions that reuse node points Suppose we have a quadrature formula 


n : : b 
1(f) = LX wef(xn) = fwd s(x) de (5.3.48) 
k=1 a 
We want to produce a new quadrature formula that uses the n nodes x,,..., x, 
and m new nodes X,41,--->X,am: 
nt+m b ; 
Tram(L) = XO vet (xe) = f(x) f(x) dx (5.3.46) 


k=] 
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These n + 2m unspecified parameters, namely the nodes x,,41,---.X,4m and the 
weights 0,,..-,Uj,4m> are to be chosen to give (5.3.46) as large a degree of 
precision as is possible. We seek a formula of degree of precision n + 2m — 1. 
Whether such a formula can be determined with the new nodes X,41,--->Xn4+m 
located in [{a, 5] is in general unknown. 

In the case that (5.3.45) is a Gauss formula, Kronrod studied extensions 
(5.3.46) with m =n + 1. Such pairs of formulas give a less expensive way of 
producing an error estimate for a Gauss rule (as compared with using a Gauss 
rule with 2n + 1 node points). And the degree of precision is high enough to 
produce the kind of accuracy associated with the Gauss rules. 

A variation on the preceding theme was introduced in Patterson (1968). For 
w(x) = 1, he started with a Gauss—Legendre rule /, (f). He then produced a 
sequence of formulas by repeatedly constructing formulas (5.3.46) from the 
preceding member of the sequence, with m = n + 1. A paper by Patterson (1973) 
contains an algorithm based on a sequence of mules 13, I5, Iy5. 131, L635 Ly275 12553 
the formula J, is the three-point Gauss rule. Another such sequence 
{ Tho, 121, 143, Ig7} is given in Piessens et al. (1983, pp. 19, 26, 27), with Jj, the 
ten-point Gauss rule. All such Patterson formulas to date have had all nodes 
located inside the interval of integration and all weights positive. 

The degree of precision of the Patterson rules increases with the number of 
points. For the sequence J,, J;,..., [255 previously referred to, the respective 
degrees of precision are d= 5,11,23,47,95,191,383. Since the weights are 
positive, the proof of Theorem 5.4 can be repeated to show that the Patterson 
rules are rapidly convergent. 

A further discussion of the Patterson and Kronrod rules, including programs, 
is given in Piessens et al. (1983, pp. 15-27); they also give reference to much of 
the literature on this subject. 


Example . Let (5.3.45) be the three-point Gauss rule on [—1, 1]: 
8 5 
1(f) = sf) + sLL(-V6) + (VS)| (5.3.47) 
The Kronrod rule for this is 
L(f) = ay f(0) + a,[ f(-V6) + f(v6)| 
+a[f(-B,) +e f(B,)I + a3[ f(—B) + f(B2)] (5.3.48) 


with 8? and 8? the smallest and largest roots, respectively, of 


,_ 0s 
x 9 ** gor 


The weights ap, a, ,,a, come from integrating over [—1,1] the Lagrange 
polynomial p,(x) that interpolates f(x) at the nodes {0, + 7.6, +f), +8)}. 
Approximate values are 


> = .450916538658 a, = .268488089868 
-104656226026 


a, = 401397414776 a, 
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5.4 Asymptotic Error Formulas and Their Applications 


Recall the definition (5.1.10) of an asymptotic error formula for a numerical 
integration formula: E,(f) is an asymptotic error formula for EAS) = 


If) ~ LCS) if 


=1 (5.4.1) 


or equivalently, 


_ EA(f) ~ Elf) , 
me EC) 


Examples are (5.1.9) and (5.1.18) from Section 5.1. 

By obtaining an asymptotic error formula, we are obtaining the form or 
structure of the error. With this information, we can either estimate the error in 
I,(f), as in Tables 5.1 and 5.3, or we can develop a new and more accurate 
formula, as with the corrected trapezoidal rule in (5.1.12). Both of these alterna- 
tives are further illustrated in this section, concluding with the rapidly convergent 
Romberg integration method. We begin with a further development .of asymp- 
totic error formulas. , 


The Bemoulli polynomials For use in the next theorem, we introduce the 
Bernoulli polynomials B(x), n = 0. These are defined implicitly by the generating 
function 


A = SD ale)S (5.4.2) 


The first few polynomials are 


Bo(x)=1 Bix)=x Bx) = xP = x 


B,(x) =x? - ~ + ; B,(x) = x2(1 — x) (5.4.3) 


With these polynomials, 
B,(0) =0 k>1 (5.4.4) 


There are easily computable recursion relations for calculating these polynomials 
(see Problem 23). 
Also of interest are the Bernoulli numbers, defined implicitly by 


fel 


= x B,- = (5.4.5) 


~ 
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The first few numbers are 


—1 1 


-1 
=— B= — (54. 
42 ° 30 (5.4.6) 


and for all odd integers j > 3, B; = 0. To obtain a relation to the Bernoulli 
polynomials B,(x), integrate (5.4.2) with respect to x on [0,1]. Then 


and thus 


1 . 
B= — [B(x) a jzl (5.4.7) 


We will also need to define a periodic extension of B,(x), 


B(x) Q<x<l 
B(x-1) x21 


B(x) = (5.4.8) 


The Euler—MacLaurin formula The following theorem gives a very detailed 
asymptotic error formula for the trapezoidal rule. This theorem is at the heart of 
much of the asymptotic error analysis of this section. The connection with some 
other integration formulas appears later in the section. 


Theorem 5.5 (Euler-MacLaurin Formula) Let m > 0, n > 1, and define h = 
(b —a)/n, x;= a+ jh for j = 0,1,...,. Further assume f(x) 
is 2m+2 times continuously differentiable on [a,b] for some 
m > 0. Then for the error in the trapezoidal rule, 


E(f) = fNe) de AE"1(3) 


ly UO (a) 


; (2)! 
fA2mr2 


* im + 2)! "Bansal }/ Om*A(x) dx (5.4.9) 


Note: The double prime notation on the summation sign means 


that the first and last terms are to be halved before summing. 


Proof A complete proof is given in Ralston (1965, pp. 131-133), and a more 
general development is sketched in Lyness and Puri (1973, sec. 2). The 
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‘proof in Ralston is short and correct, making full use of the special 
properties of the Bernoulli polynomials. We give a simpler, but less 
general, version of that proof, showing it to be based on integration by 
parts with a bit of clever algebraic manipulation. 

The proof of (5.4.9) for general n > 1 is based on first proving the 
result for n = 1. Thus we concentrate on 


B(1) = f(x) & ~ $140) +s) 


= sf'reoxt —h) dx | (5.4.10) 


the latter formula coming from (5.1.21). Since we know the asymptotic 
formula 


h2 
E(f)=-GU') - rol 
we attempt to manipulate (5.4.10) to obtain this. Write 
= at ow x(x xt) 

E,(f) firw|-S las fir 72 “a dx 

Then 
h? h x? xh hh? 
EU)==Z ira -fOl+ [POs => + |e 


Using integration by parts, 


3 xh h2x A 
falas * | 
0 


E,(f) = ir (A) - f'(0)] + 


The evaluation of the quantity in brackets at x = 0 and x =h gives 
zero. Integrate by parts again; the parts outside the integral will again be 
zero. The result will be 


E\(f) = -Surw — f’(0)] + ag f'( — hy dx (5.4.11) 
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which is (5.4.9) with m = 1. To obtain the m = 2 case, first note that 


1 he 
— ["x?(x — A)’ dx = — 
ah 720 


Then as before, write 


fier _ hy? ae [reco] | dx 


=a * |p (H) 0) 


2x%h + h‘ 
+ fi7%n| = =| 


Integrate by parts twice to obtain the m = 2 case of (5.4.9). This can be 
continued indefinitely. The proof in Ralston uses integration by parts, 
taking advantage of special relations for the Bernoulli polynomials (see 
Problem 23). 

For the proof for general n > 1, write 


id xj h 
E(f)= 2 | Jo f(x) & - 5 [f(%-1) +1x)]} 
J=LN %-1 
For the m = 1 case, using (5.4.11), 


jal 


E,(f) = x {- SU) -(x..))} 
+ EL LN a sPe a5) 


h2 
= FU) - rol +E f'r()B (4) ae (6.429 
The proof for m > 1 is essentially the same. | 


The error term in (5.4.9) can be simplified using the integral mean value 
theorem. It can be shown that 


B,(x)>0 O<x<1 ~ (5.4.13) 
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.and consequently the error term satisfies 


[Bamr{ = |e x) de = 42°98) f”Bameal =F) 


ath 


npera(Ey Prasa( >} de 


ist nlf 2"*2(£) FB 4 (u) Hi 
0 


—(b- a) By 4, f?"'™(E) someast<b 


Thus (5.4.9) becomes 


ECS) = — ¥en fe) — f(a) 


roy (27)! 
— hem*2(b a a) Bons 


(2m + 2)! jae) (5.4.14) 


for some a < é <b. 
As an important corollary of (5.4.9), we can show that the trapezoidal rule 
performs especially well when applied to periodic functions. 


Corollary 1 Suppose f(x) is infinitely differentiable for a < x < b, and suppose 
that all of its odd ordered derivatives are periodic with b — a an 


integer multiple of the period. Then the order of convergence of the 
trapezoidal rule J,( f) applied to 


=f "f(x) de 


is greater than any power of h. | 


Proof Directly from the assumptions on f(x), 
fU-M(b) =fF-Ma) 21 (5.4.15) 
Consequently for any m = 0, with h = (b — a)/n, (5.4.14) implies 


2m+2 


lf) Tf) = Gmim os 2) Boma 2f"*(€) ast<b (5.4.16) 
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Thus as n — oo (and h — 0) the rate of convergence is proportional to 
h?™*2 But m was arbitrary, which shows the desired result. | 


This result is illustrated in Table 5.7 for f(x) = exp (cos(x)). The trapezoidal 
rule is often the best numerical integration rule for smooth periodic integrands of © 
the type specified in the preceding corollary. For a comparison of the - 
Gauss—Legendre formula and the trapezoidal rule for a one-parameter family of 
periodic functions, of varying behavior, see Donaldson and Elliott (1972, p. 592). 
They suggest that the trapezoidal rule is superior, even for very peaked in- 
tegrands. This conclusion improves on an earlier analysis that seemed to indicate 
that Gaussian quadrature was superior for peaked integrands. 


The Euler—MacLaurin summation formula Although it doesn’t involve numeri- 
cal integration, an important application of (5.4.9) or (5.4.14) is to the summation 
of series. 


Corollary 2 (Euler-MacLaurin summation formula) Assume f(x) is 2m + 2 


times continuously differentiable for 0 < x < 00, for some m = 0. 
Then for all n > 1, 


Es) = (1) a + S10 + /0)] 


3 y ai eM Gla ZeENO)) = G47) 
apg Bama) de 


Proof Merely substitute a = 0, b = n into (5.4.9), and then rearrange the terms 
appropriately. a 


Example In a later example we need the sum 


1 


co 
s=> ar (5.4.18) 
1 


If we use the obvious choice f(x) = (x + 1)~?/? in (5.4.17), the results are 
disappointing. By letting n — 00, we obtain 


= a eS af Qi-1) 


— Grea Bansal) fOr (x) de 
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--and the error term does not become small for any choice of m. But if we divide 


the series S into two parts, we are able to treat it very accurately. 
Let f(x) = (x + 10)~?”*. Then with m = 1, 


ei ft es i hI AD) 
2 n3/2 me EU) ie (x + 10)? 2(10)*” 2 (10)°” 
B= — = [°B,(x)f(x) de 


24 Jo 


Since B,(x) = 0, f(x) > 0, we have E < 0. Also 


pee. "(]eec )d Sos ee ages 
<-E<— — x)dx= = 1.08 x 107 
24 Jy \ 16 (1024)(10)”” 
Thus 
2 4} 
LA = 648662205 + E 


9 
By directly summing ))(1/n?/7) = 1.963713717, we obtain 
1 


o 


1 * 
Y= = 2.6123759 + E 0< —E< 1.08 x 107° 5.4.19) 
; n3/2 


See Ralston (1965, pp. 134-138) for more information on summation techniques. 
To appreciate the importance of the preceding summation method, it would have 
been necessary to have added 3.43 x 10’ terms in S to have obtained compara- 
ble accuracy. 


A generalized Euler—MacLaurin formula For integrals in which the integrand is 
not differentiable at some points, it is still often possible to obtain an asymptotic 
error expansion. For the trapezoidal rule and other numerical integration rules 
applied to integrands with algebraic and/or logarithmic singularities, see the 
article Lyness and Ninham (1967). In the following paragraphs, we eae 
their results to the integral 


I= [x9 dx a>0 (5.4.20) 


with f € C™*}[0, 1], using the trapezoidal numerical integration rule. 
Assuming a is not an integer, 


: m-1 c; m—-\ d; s 1 
E,(f) = x aaa 2 sii + o =| (5.4.21) 


j=0 j=l 


oh ae 
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The term O(1/n”*!) denotes a quantity whose size is proportional to 1/n”*}, or 


possibly smaller. The constants are given by 
2M(a+j+1)sin[(7/2)(a+j)] f(a +j+1) 


c= (20 ae 4 f'?(0) 


d,=0 for j even 


25(j + 1) 


= (_1\U-)2, 
d, = ( 1) (20) 


g/(1) j odd 


with g(x) = x°f(x), I(x) the gamma function, and §( p) the zeta function, 
1 
eis 


18 


(p) = p> (5.4.22) 


f 


For 0 < a < 1 with m = 1, we obtain the asymptotic error estimate 


He 2T(a + 1)sin[(x/2)a]¢(a + 1)f(0) : of 1 


Oa) igent df fe C?[0,1]. 
(5.4.23) 


For example with J = f ‘x. f(x) dx, and using (5.4.19) for evaluating £(3), 
0 


s(3) 


4a 


c 1 
E,(f) = al) + o{ =] c= = .208 (5.4.24) 


This is confirmed numerically in the example given in Table 5.6 in Section 5.1. 
For logarithmic endpoint singularities, the results of Lyness and Ninham 
’ (1967) imply an asymptotic error formula 


c- wee 


E,(f) =——— (5.4.25) 


for some p > 0 and some constant c. For numerical purposes, this is essentially 
O(1/n?’). To justify this, calculate the following limit using ee rule, 
with VP > | 


log (n)/n? __. log() 


n> 1/n4 n>o0 nP-d 


This means that log(n)/n? decreases more rapidly than 1/n? for any q < p. 
And it clearly decreases less rapidly than 1/n?’, although not by much. 


292 NUMERICAL INTEGRATION 


For practical computation, (5.4.25) is essentially O(1/n”). For example, 
calculate the limit of successive errors: 


gare _. ¢*log(n)/n? eis log (7) 
= Limit ———-————_ = _ Limit2? - ———~ 
no T—I,, no C-log(2n)/2?n? n- log (2n) 


1 
BE Tt Pk ee en F 
eS eae? oe ny = 


This is the same limiting ratio as would occur if the error were just O(1/n”). 


Aitken extrapolation Motivated by the preceding, we assume that the integra- 
tion formula has an asymptotic error formula 


c 
I-I,=— p>0 (5.4.26) 


This is not always valid. For example, Gaussian quadrature does not usually 
satisfy (5.4.26), and the trapezoidal rule applied to periodic integrands does not 
satisfy it. Nonetheless, many numerical integration rules do satisfy it, for a wide 
variety Of integrands. Using this assumed form for the error, we attempt to 
estimate the error. An analogue of this work is that on Aitken extrapolation in 
Section 2.6. 
First we estimate p. Using (5.4.26), 
| ae O (1-1,)- (1-1) é (c/n?) — (c/2?n?) 


Fen Taq (1 Lag) ~ (2 Tan) (0/2?) = (c/4?n?) 


This gives a simple way of computing p. 


Example Consider the use of Simpson’s rule with fdxVx dx = 0.4. In Table 
5.15, column R, should approach 27° = 5.66, a theoretical result from Lyness 


Table 5.15 Simpson integration errors for f "vx dx 
0 


n ’ I-f, L~ Ia2 R,, 
ae. -402368927062 — 2.369 — 3 , 
4 400431916045 — 4.319 — 4 —1.937 — 3 
8 -400077249447 — 7.725 — 5 — 3.547 — 4 5.46 
16 400013713469 —1.371 — 5 — 6.354 — 5 5.58 
32.—Ci«w -400002427846 — 2.428 — 6 —1.129 — 5 5.63 
64 -400000429413 —4,294 —7 —1.998 — 6 5.65 
128 -400000075924 — 7.592 — 8 —3.535 —7 5.65 
256 .400000013423 ~1,342 — 8 — 6.250 — 8 5.66 


512° -400000002373 —2.373 ~ 9 —1.105 — 8 5.66 
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and Ninham (1967) for the order of convergence. Clearly the numerical results 
confirm the theory. 


To estimate the integral J with increased accuracy, suppose that J,, J,,, and 
I,,, have been computed. Using (5.4.26), 


ie ae ce © 


I-l, I-14, 


and thus 
(1-1,)U - Lan) = = Jan)” 


Solving for J, and manipulating to obtain a desirable form, 


(Lan a Inn)” 


f=) (as Cag) 


F=f, 7 ly 


Example . Using the previous example for f(x) = x¥x and Table 5.15, we 
obtain the difference table in Table 5.16. Then 


1 = I, = 399999999387 
I-—ig=613X107 = I-Igy = ~4.29 x 1077 


Thus /,, is a considerable improvement on J,4. Also note that Ig, — Ig, is an 
excellent approximation to I — Ig,. 


‘Summing up, given a numerical integration rule satisfying (5.4.26) and given 
three values I, J,,,, I4,, calculate the Aitken extrapolate f,, of (5.4.28). It is 
usually a significant improvement on J,, as an approximation to J; and based on 
this, 

I-h, = 14, - Lan (5.4.29) 

With Simpson’s rule, or any other composite closed Newton—Cotes formula, 
the expense of evaluating I,,, I,,, I,,, is no more than that of J,,, alone, namely 
4n + 1 function evaluations. And when the algorithm is designed correctly, there 
is no need for temporary storage of large numbers of function values f(x,). For 


Table 5.16 Difference table for Simpson integration 


m i AL = Ei. . WT, 
40001 
se Mined —1.1285623E — 5 
32 400002427846 9.28719E — 6 
—1.998433E — 6 


64 .400000429413 
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that reason, one should never use Simpson’s rule with just one value of the index 
n. With no extra expenditure of time, and with only a slightly more complicated 
algorithm, an Aitken extrapolate and an error estimate can be produced. 


Richardson extrapolation If we assume sufficient smoothness for the integrand 
f(x) in our integral J(f), then we can write the trapezoidal error term (5.4.9) as 


dae a 
n n n 


where J, denotes the trapezoidal rule, and 


(b ky rh ls 


be x—-a@ i : 
Fam Tame aygtart J, Banal FO") 
a Lace, oat (b— a)! [ f27-Y(b) — f%-0(a)] (5.4.31) 


Although the series dealt with are always finite and have an error term, we will 
usually not directly concern ourselves with it. 
For n even, : 


44 16d 64d 
Lpa=—-+— + oa (5.4.32) 


n n 


Multiply (5.4.30) by 4 and subtract from it (5.4.32): 


-12d 604 


AP 1) (day) aera 


41,-Inn 44 20d 


po = ~ [41 = T® | n even n>2 (5 4 33) 
n 3 n n/2 — ad 


and [© = J. We call {J} the Richardson extrapolate of {J}. 
The sequence 


1, 1), 1,... 


is a new numerical integration rule. For the error, 


d®) d®) 
T-IM= neo ge oP mye (5.4.34) 


d® = ~44®, a = —20d®,.... (5.4.35) 


res Pe ae 
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To see the explicit formula for J, let h =(b—a)/n and x,=a + jh for 


j = 0,1,..., . Then using (5.4.33) and the definition of the trapezoidal rule, 


T® = +13 


n 


1 
ghthththt ther + 5h 


2hfl 1 
=a zfe ty oF fg lek ee Heat aan 


1M = Lh + Af, + 2fyt4fyt---4+2f,-2+ 41+] (5.4.36) 


which is Simpson’s rule with n subdivisions. For the error, using (5.4.35) and 
(5.4.31), 


ae es 10) ~ f%(a)] + 
(5.4.37) 


This means that the work on the Euler-MacLaurin formula transfers to Simpson’s 
rule by means of some simple algebraic manipulations. We omit any numerical 
examples since they would just be Simpson’s rule, due to (5.4.36). 

The preceding argument, which led to 1, can be continued to produce other 
new formulas. As before, if n is a multiple of 4, then 


16d 644% 
1-1) = a ee 


—48d@ 


1g a1?) = (51 = 
167) — 7%, 48d 
pes Bae . (5.4.38) 
15 151 
Then 
d®  d@ 
1-1@=+—+- (5.4.39) 
n§ n’ : 
with 
167 — 7% 
[@ =m —2 "2 n> (5.4.40) 


15 


and n divisible by 4. We call {,%} the Richardson extrapolate of {J}. If we 
derive the actual integration weights of J, in analogy with (5.4.36), we will find 
that J is simply the composite Boole’s “role. 


' 
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Using the preceding formulas, we can obtain useful estimates of the error. - 
Using (5.4.39), 


167) — 1), dq) 
Vics FO) Se ee nO ee 
e 15 - n® 
el ee ae 
15 n§ 
Using A = (b — a)/n, 
1 
1-19 = ee — 1%] + O(n°) (5.4.41) 
and thus 
1 
T-1M+ ra — 1%, (5.4.42) 


since both terms are O(h*) and the remainder term is O(h*). This is called 
Richardson’s error estimate for Simpson’s rule. 
This extrapolation process can be continued inductively. Define 


ky(k-1 = 
4te) — pen 


[H= 
a 4k] 


n>2* (5.4.43) 


with n a multiple of 2‘, k > 1. It can be shown that the error has the form 


(k) 
Dies? 
nzk+2 


1-1[= 


= Ay(b— a)WPA2fCHI(E) gt <b (5.4.44) 
with A, a constant independent of f and h, and 
dey = Aglb— a)**7[24*9(5) — F2**(a)] 
Finally, it can be shown that for any f € C[a, b], 


Limit (f) = 1(f) (5.4.45) 


The rules J“(f) for k>2 bear no direct relation to the composite 
Newton—Cotes rules. See Bauer et al. (1963) for complete details. 


Romberg integration Define 


JAf) =I) k= 0,1,2,... (5.4.46) 
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1 

1 = 70) 

1 7M 72) 

1 12 12) 12 
Ti He 1 Te Ti 


Figure 5.4 Romberg integration table. 


This is the Romberg integration rule. Consider the diagram in Figure 5.4 for the 
Richardson extrapolates of the trapezoidal rule, with the number of subdivisions 
a power of 2. The first column denotes the trapezoidal rule, the second Simpson’s 
tule, etc. By (5.4.45), each column converges to I(f). Romberg integration is the 
rule of taking the diagonal. Since each column converges more rapidly than the 
preceding column, assuming f(x) is infinitely differentiable, it could be expected 
that J,(f) would converge more rapidly than {7*)} for any k. This is usually 
the case, and consequently the method has been very popular since the late 1950s. 
Compared with Gaussian quadrature, Romberg integration has the advantage of 
using evenly spaced abscissas. For a more complete analysis of Romberg integra- 
tion, see Bauer et al. (1963). 


Example Using Romberg integration, evaluate 


I= [e* cos (x) dx = —}(e"+1) 
0 


This was used previously as an example, in Tables 5.1, 5.3, and 5.11, for the 
trapezoidal, Simpson, and Gauss—Legendre rules, respectively. The Romberg 
results are given in Table 5.17. They show that Romberg integration is superior to 
Simpson’s rule, but Gaussian quadrature is still more rapidly convergent. 

To compute J,(f) for a particular k, having already computed J,(/), 
.-+) 4p_1(f), the row 


I aed pe EO (5.4.47) 


Table 5.17 Example of Romberg integration 


k Nodes If) Error 

0 2 ; — 34.77851866026 , 2.27E + 1 
1 3 —~ 11.59283955342 —4,78E -—1 
2 5 —12.01108431754 —5.93E — 2 
3 9 ~ 12.07042041287 7.41E — 5 
4 17 —12.07034720873 . 8.92E — 7 
5 33 ~12.07034631632 —6.82E — 11 
6 


65 ~ 12.07034631639 < 5.00E — 12 


uate Be By inn 
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should have been saved in temporary storage. Then compute J) from J), and 
2-1 new function values. Using (5.4.33), (5.4.40), and (5.4. 43) compute the next 
row in the table, including J,(f). Compare J,(f) and J,_,(/) to see if there is 
sufficient accuracy to accept J,(/) as an accurate approximation ‘to /(/). 

We give this procedure in a formal way in the following algorithm. It is 
included for pedagogical reasons, and it should not be considered as a serious 
program unless some improvements are included. For example, the error test is 
primitive and much too conservative, and a-safety check needs to be included for 
the numerical integrals associated with small k, when not enough function values 
have yet been sampled. 


Algorithm Romberg (f, a, b, €, int) 


1. Remark: Use Romberg integration to calculate int, an estimate 
of the integral 


[= [1 dx 


Stop when |J — int| <e. 


2. Initialize: 


Ty = Ro = &% = (6 — al f(a) + f(b)\/2 
3. Begin the main loop: | 
n=2n k=k+1 h=(b-—a)/n 
n/2 


4. sum= ) f(a + (2j-1)h) 
j=l 


5. 7, = h-sum + alent 

6 B= a; j=0,1,...,k-1 

7. a= T, m:=1 
’ 8. Do through step 10 for j = 1,2,...,k 
9. m:=4m 


a m- @_y — B-1 
ede eae 
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Il, Ry, = oy 
12. If |R, — Ry_,| > €, then go to step 3 
13. Since |R, — R,_,| < €, accept int = R,_, and return. 


There are many variants of Romberg integration. For example, other ways of 
increasing the number of nodes have been studied. For a very complete survey of 
the literature on Romberg integration, see Davis and Rabinowitz (1984, pp. 
434-446). They also give a Fortran program for Romberg integration. 


5.5 Automatic Numerical Integration | 


An automatic numerical integration program calculates an approximate integral 
to within an accuracy specified by the user of the program. The user does not 
need to specify either the method or the number of nodes to be used. There are 
some excellent automatic integration programs, and many people use them. Such 
a program saves you the time of writing your own program, and for many people, 
it avoids having to understand the needed numerical integration theory. Nonethe- 
less, it is almost always possible to improve upon an automatic program, 
although it usually requires a good knowledge of the numerical integration 
needed for your particular problem. When doing only a small number of — 
numerical integrations, automatic integration is often a good way to save time. 
But for problems involving many integrations, it is probably better to invest the 
time to find a less expensive numerical integration procedure. 

An automatic numerical integration program functions as a “black box,” 
without the user being able to see the intermediate steps of the computation. 
Because of this, the most important characteristic of such a program is that it be 
reliable: The approximate integral that is returned by the program and that the 
program says satisfies the user’s error tolerance must, in fact, be that accurate. In 
theory, no such algorithm exists, as we explain in the next paragraph. But for the 
type of integrands that one usually considers in practice, there are programs that 
have a high order of reliability. This reliability will be improved if the user reads 
the program description, to see the restrictions and assumptions of the program. 

To understand the theoretical impossibility of a perfectly reliable automatic 
integration program, note that the program will evaluate the integrand f(x) at 
only a finite number of points, say x,,...,x,. Then there are an infinity of 
continuous functions f(x) for which 


F(x) =f(x,) i= 1,..., 0 


and 
[F) dx # [fe dx 


In fact, there are an infinity of such functions f(x) that are infinitely differentia- 
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ble. For practical problems, it is unlikely that a well-constructed automatic 
integration program will be unreliable, but it is possible. An automatic integra- 
tion program can be made more reliable by increasing the stringency of its error 
tests, but this also makes the program less efficient. Generally there is a tradeoff 
between reliability and efficiency. For a further discussion of the questions of 
reliability and efficiency of automatic quadrature programs, see Lyness and 
Kaganove (1976). 


Adaptive quadrature Automatic programs can be divided into (1) those using a 
global rule, such as Gaussian quadrature or the trapezoidal rule with even 
spacing, and (2) those using an adaptive strategy, in which the integration rule 
varies its placement of node points and even its definition to reflect the varying 
local behavior of the integrand. Global strategies use the type of error estimation 
that we have discussed in previous sections. We now discuss the concept and 
practice of an adaptive strategy. 

Many integrands vary in their smoothness or differentiability at different 
points of the interval of integration [a, b]. For example, with 


= [ve ax 


the integrand has infinite slope at x = 0, but the function is well behaved at 
points x near 1. Most numerical methods use a uniform grid of node points, that 
is, the density of node points is about equal throughout the integration interval. 
This includes composite Newton—Cotes formulas, Gaussian quadrature, and 
Romberg integration. When the integrand is badly behaved at some point a in 
the interval [a, b], many node points must be placed near a to compensate for 
this. But this forces many more node points than necessary to be used at all other 
parts of [a, b]. Adaptive integration attempts to place node points according to 
the behavior of the integrand, with the density of node points being greater near 
points of bad behavior. 

We now explain the basic concept of adaptive integration using a simplified 
adaptive Simpson’s rule. To see more precisely why variable spacing is necessary, 
consider pimpeon: s rule with such a spacing of the nodes: 


(fu X2;-2 


FEV aya + Maes + Sas) 


I(f) = ke f(x) dx =1,(f) = = 


J=1 %2j-2 =1 


_ (5.5.1) 


with x; = (X2;-2 + X2,)/2. Using (5.1.15), 
If) -1,(f) = 3880 | oy (x2; ~ ¥j-2) S (6) (5.5.2) 


with x,,_, < &, < x,,. Clearly, you want to choose x, ; — X;.2 according to the 
size of f(£), which is unknown in general. If f(x) varies greatly in magni- 
tude, you do not want even spacing of the node points. 


og pole Sa 
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As notation, introduce 


Tap > ['I(e) ax 


a+ B 


Ia = |e) + af +1(8)) tare (5.5.3) 


Brita 
2 


=7( 1 ms 
12,=12 +10, y= 
To describe the adaptive algorithm for computing 
b 
I= f f(x) dx 


we use a recursive definition. Suppose that « > 0 is given, and that we want to 
find an approximate integral J for which : 


[I-I| <e (5.5.4) 
Begin by setting a = a, 8 = b. Compute 1), and I>. If 
[12, — 12,] <e (5.5.5) 


then accept 17), as the adaptive integral approximation to I, g. Otherwise let 
e = €/2, and set the adaptive integral for J, » equal to the sum of the adaptive 
integrals for I, , and I, 2, y = (a+ B)/2, each to be computed with an error 
tolerance of €. 

In an actual implementation as a computer program, many extra limitations 
are included as safeguards; and the error estimation is usually much more 
sophisticated. All function evaluations are handled carefully in order to ensure 
that the integrand is never evaluated twice at the same point. This requires a 
clever stacking procedure for those values of f(x) that must be temporarily 
stored because they will be needed again later in the computation. There are 
many small modifications that can be made to improve the performance of the 
program, but generally a great deal of experience and empirical investigation is 
first necessary. For that and other reasons, it is recommended that standard 
well-tested adaptive procedures be used [e.g., de Boor (1971), Piessens et al. 
(1983)]. This is discussed further at the end of the section. 


Table 5.18 Adaptive Simpson’s example (5.5.6) 


[a, B] ym T-J® r-g@ jr) ~ 1) € 
(0.0, .0625} .010258 1.6E -— 4 4.5E-—4 2.9E — 4 .0003125 
[.0625, .0125] -019046 1.2E -—7 11E — 6 1L0E — 6 .0003125 
[.125, .25} .053871 45E-—7 3.6E — 6 4.0E — 6 .000625 
[.25, .5] .152368 9.3E-—7 11E —5 1.0E — 5 00125 


[.5, 1.0] 430962 2.4E — 6 3.0E — 5 2.8E — 5 0025 
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Example Consider using the preceding simpleminded adaptive Simpson proce- 


dure to evaluate 
1 
l= vx ax 
J 


with € = .005 on [0,1]. The final intervals [a, 8] and integrals J Me are given in 
Table 5.18. The column labeled ¢€ gives the error tolerance used in the test (5.5.5), 
which estimates the error in /{”. The error estimated for I{"}, on [0, 0625] was 
inaccurate, but it was accurate for the remaining subintervals. The value used to 
estimate I, , is actually }, and it is sufficiently accurate on all subintervals. 
The total integral, obtained by summing all 1), is 


= 666505 I-T=16E-4 
and the calculated bound is 


(5.5.6) 


\f-I| <33E-4 


obtained by summing the column labeled |J@ — J), Note that the error is 
concentrated on the first subinterval, as could have been predicted from the 
behavior of the integrand near x = 0. For an rape where the test (5.5.5) is not 
adequate, see Problem 32. 


Some automatic integration programs One of the better known automatic 
integration programs is the adaptive program CADRE (Cautious Adaptive 
Romberg Extrapolation), given in de Boor (1971). It includes a means of 
recognizing algebraic singularities at the endpoints of the integration interval. 
The asymptotic error formulas of Lyness and Ninham (1967), given in (5.4.21) in 
a special case, are used to produce a more rapidly convergent integration method, 
again based on repeated Richardson extrapolation. The routine CADRE has 
been found empirically to be both quite reliable and efficient. 


Table 5.19 Integration examples for CADRE 


Desired Error 
10-2 1075 1078 
Integral Error N Error N Error N 

q A = 2.49E — 6 76 A=140E-—7 733. A=460E-11 225 
P = 5.30E — 4 P = 4.45E — 6 P = 2.48E — 9 

Lh A =1.18E — 5 9 A=3.96E-—7 17. P=2,73E—10 129 
P = 3.27E — 3 P = 3.56E — 6 P=281E—9 

i A = 1.03E —4 17 A=3.23E-—8 33 A=1.98E-9 65 
P=2.98E — 3 P =443E — 8 P=2.86E — 9 

I, A=657TE-5 105 A=645E-—8 209 A=480E-9 281 
P = 4.98E — 3 P = 9.22E — 6 P=1.55E — 8 

I, A=2.71E-—5 226 A=T7AIE-8 418 A=589E-—9 562 
P = 3.02E — 3 P=1.00E-—5 P=1.11E~8 

I, A=849E-6 955 A=237E-—8 1171 A=430E-—11 2577 
P = 8.48E — 3 P=167E—-5 P=2.07E — 8 

LL A = 454E —4 98 A=7.72E-—7 418 A=** 1506 
P = 1.30E — 3 P = 8.02E — 6 P=4* 
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A more recently developed package is QUADPACK, some of whose programs 
are general purpose, while others deal with special classes of integrals. The 
package was a colloborative effort, and a complete: description of it is given in 
Piessens et al. (1983). The package is well tested and appears to be an excellent 
collection of programs. 

We illustrate the preceding by calculating numerical approximations to the 
following integrals: 


1 4 dx 1 ; 
I= | ————— = — |tan7!(10) + tan! (6 
* i 1 + 256(x — .375)’ al ~ (9) 


lines 
0 

re & [ce |x = .7| dx = 3 log (.3) + .Tlog (.7) =] 
0 


50 


— p7~2n 
mort) 


I,= [re sin (50x) dx = 


From QUADPACK, we chose DQAGP. It too contains ways to recognize’ 
algebraic singularities at the endpoints and to compensate for their présence. To 
improve performance, it allows the user to specify points interior to the integra- 
tion interval at which the integrand is singular. 

We used both CADRE and DQAGF to calculate the preceding integrals, with 
error tolerances of 10-2, 10-5, and 10~*® The results are shown in Tables 5.19 
and 5.20. To more fairly compare DQAGP and CADRE, we applied CADRE to 
two integrals in both I, and J,, to have the singularities occur at endpoints. For’ 
example, we used CADRE for each of the integrals in ; 


0 ax 10000 ax 
L= ns ae 5.5.7 
fost] -s (5.5.7) 


In the tables, P denotes the error bound predicted by the program and A denotes 
the actual absolute error in the calculated answer. Column N gives the number of 
integrand evaluations. At all points at which the integrand was undefined, it was 
arbitrarily set to zero. The examples were computed in double precision on a 
Prime 850, with a unit round of 2~-“ = 1.4 x 1074. 

In Table 5.19, CADRE failed for J, with the tolerance 10-*, even though 
(5.5.7) was used. Otherwise, it performed quite well. When the decomposition 


wae 


304 NUMERICAL INTEGRATION 


Table 5.20 Integration examples for DQAGP 


Desired Error 
107? 107° 107° 
Integral Error N Error N Error N 
I, A=288E-9 105 A=540E-13 147 A=540E-13 147 
P = 2.96E — 3 P= 5.21E - 10 ; P = $§.21E — 10 
i, A=117E-11 21 A=117E-11 21 A=N11I7E-11 21 
P = 7.46E - 9 P =7.46E — 9 P = 7A6E — 9 
I, A = 4.79E — 6 21 A=462E-13 189 A=462E—-13 189 
P = 4.95E — 3 P=477E - 14 P=4.77E — 14 
I, A=597TE-13 231 A=597E-13 231 A=5.9TE- 13 231 
_ P=715E- 14 P=7.15E—-14 P=T7.15E—-—14 
I; A=867E-13 462 A=867E-13 462 A =867E-— 13 462 
P=1.15E—-—13 P =1,15E — 13 P=1,15E — 13 
I, A = 1.00E - 3 525 A=633E-14 861 A=+533E-—14 1239 
P = 436E — 3 P = 8.13E — 6 P=T712E-9 
LZ A=167E-—10 462 A=167E-10 462 A=167E-10 462 
P=1.16E — 10 P=1.16E - 10 P=1.16E — 10 


(5.5.7) is not used and CADRE is called only once 
{—9, 10000], it fails for all three error tolerances. 
In Table 5.20, the predicted error is in some cases smaller than the actual 


for the single interval 


error. This difficulty appears to be due to working at the limits of the machine 
arithmetic precision, and in all cases the final error was well within the limits 
requested. 

In comparing the two programs, DQAGP and CADRE are both quite reliable 
and efficient. Also, both programs perform relatively poorly for the highly 
oscillatory integral J,, showing that I, should be evaluated using a program 
designed for oscillatory integrals (such as DQAWO in QUADPACK, for Fourier 
coefficient calculations). From the tables, DQAGP is somewhat more able to deal 
with difficult integrals, while remaining about equally efficient compared to 
CADRE. Much more detailed examples for CADRE are given in Robinson 
(1979). 

’ Automatic quadrature programs can be easily misused in large calculations, 
resulting in erroneous results and great inefficiency. For comments on the use of 
such programs in large calculations and suggestions for choosing when to use 
them, see Lyness (1983). The following are from his concluding remarks. 


The Automatic Quadrature Rule (AQR) is an impressive and practical item 
of numerical software. Its main advantage for the user is that it is conveni- 
ent. He can take it from the library shelf, plug it in, and feel confident that 
it will work. For this convenience, there is in general a modest charge in 
CPU time, this surcharge being a factor of about 3. The Rule Evaluacion 
Quadrature Routine (REQR) [non-automatic quadrature rule] does not 
carry this surcharge, but to code and check out an REQR might take a 
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couple of hours of the user’s time. So unless the expected CPU time is high, 
many user’s willingly pay the surcharge in order to save themselves time 
and trouble. 

However there are certain—usually large scale—problems for which the 
AQR is not designed and in which its uncritical use can lead to CPU time 
surcharges by factors of 100 or more. ... These are characterized by the 
circumstances that a large number of separate quadratures are involved, 
and that the results of these quadratures are subsequently used as input to 
some other numerical process. In order to recognize this situation, it is 
necessary to examine the subsequent numerical process to see whether it 
requires a smooth input function.... For some of these problems, an 
REQR is quite suitable while an AQR may lead to a numerical disaster. 


5.6 Singular Integrals 


We discuss the approximate evaluation of integrals for which methods of the type 
discussed in Sections 5.1 through 5.4 do not perform well: these methods include 
the composite Newton—Cotes rules (e.g., the trapezoidal rule), Gauss—Legendre 
quadrature, and Romberg integration. The integrals discussed here lead to poorly 
convergent numerical integrals when evaluated using the latter integration rules, 
for a variety of reasons. We discuss (1) integrals whose integrands contain a 
singularity in the interval of integration (a, b), and (2) integrals with an infinite 
interval of integration. Adaptive integration methods can be used for these 
integrals, but it is usually possible to obtain more rapidly convergent approxima- 
tions by carefully examining the nature of the singular behavior and then 
compensating for it. 


Change of the variable of integration We illustrate the importance of this idea 
with several examples. For 


I= pee (5.6.1) 


vx 


with f(x) a function with several continuous derivatives, let x = u?,0 <u < yb. 
Then . 


I= 2[¥ #(u?) du 
0 
This integral has a smooth integrand and standard techniques can be applied 


to it. 
‘Similarly, 


fi sin (x)V1 -— x? dx = 2 f'u°v2 — u? sin(1 — u?) du 
Oo - 0 


using the change of variable u = ¥1 — x. The right-hand integrand has an 
infinite number of continuous derivatives on [0,1], whereas the derivative of the 
first integrand was singular at x = 1. 
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For an infinite interval of integration, the change of variable technique is also 
useful. Suppose 


po (a p>l (5.6.2) 


with Limit, _. ,, f(x) existing. Also assume f(x) is smooth on 1, oo). Then use 
the change of variable 


1 —a 
— dx = 


a ite du forsome a>0O 
Then 
1 du 1 1 
= pearl) —__ | —___ = (p-Ya-1e] 
T=af'u tx) sre afiu y{—) au (5.6.3) 


Maximize the smoothness of the new integrand at u = 0 by picking a to produce 
a large value for the exponent ( p — 1)a — 1. For example, with 


vo f(x) dx 
el aes 


the change of variable x = 1/u‘* leads to 
1 {1 ; 
=4 —=|4d 5.6.4 
- [ uf “| u ( ) 


If we assume a behavior at x = 00 of 


Cy C2 
i) =e EG Ee 
x x 


then 


1 
{a =coutcur+curt+--: 


and (5.6.4) has a smooth integrand at u = 0. 
An interesting idea has been given in Iri et al. (1970) to deal with endpoint 
singularities in the integral 


1s [ *#(x) dx (5.6.5) 

Define 
Y(t) = exp ( ; =) (5.6.6) 
p(t)=at ma ft Wu) di -~l<1<1 (5.6.7) 
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where c is a positive constant and 
1 
r=f ou) de 


As t varies from —1 to 1, p(t) varies from a to b. Using x = p(t) asa change 
of variable in (5.6.5), we obtain 


T= [ flo(O)o"(e) at (5.6.8) 


The function y(t) = ((b — a)/y)#(t) is infinitely differentiable on [—1, 1], and 


‘it and all of derivatives are zero at t= +1. In (5.6.8), the integrand and all of 


derivatives will vanish at ¢= +1 for virtually all functions f(x) of interest. 
Using the error formula (5.4.9) for the trapezoidal rule on [—1, 1], it can be seen 
that the trapezoidal rule will converge very rapidly when applied to (5.6.8). We 
will call this method the IMT method. 

This method has been implemented in de Doncker and Piessens (1976), and in 
the general comparisons of Robinson (1979), it is rated as an extremely reliable 
and quite efficient way of handling integrals (5.6.4) that have endpoint singulari- 
ties. De Doncker and Piessens (1976) also treat integrals over [0, 00) by first using 
the change of variable x = (1 + u)/(1 — u), —1 < u < 1, followed by the change 
of variable u = ¢(t). 


Example Use the preceding method (5.6.5)-(5.6.8) with the trapezoidal rule, to 
evaluate 


= .8862269 (5.6.9) 


n| 


Note that the integrand has singular behavior at both endpoints, although it is 
different in the two cases. The constant in (5.6.6) is c = 4, and the evaluation of 
(5.6.7) is taken from Robinson and de Doncker (1981). The results are shown in 
Table 5.21. The column labeled nodes gives the number of nodes interior to [0, 1]. 


Table 5.21. Example of the IMT 


method . 
Nodes Error 
2 —6.54E — 2 
4 5.82E — 3 
8 -1.30E — 4 
16 7.42E — 6 
32 117E- 8 


64 1.18E — 12 
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Gaussian quadrature In Section 5.3, we developed a general theory for Gauss- 
ian quadrature formulas 


[wf (2) ax = Y maf.) n>1 


that have degree of precision 2n — 1. The construction of the nodes and weights, 
and the form of the error, are given in Theorem 5.3. For our work in this section 
we note that (1) the interval (a, b) is allowed to be infinite, and (2) w(x) can 
have singularities on (a, b), provided it is nonnegative and satisfies the assump- 
tions (4.3.8) and (4.3.9) of Section 4.3. For rapid convergence, we would also 
expect that f(x) would need. to be a smooth function, as was illustrated with 
Gauss—Legendre quadrature in Section 5.3. 

The weights and nodes for a wide variety of weight functions w(x) and 
intervals (a, b) are known. The tables of Stroud and Secrest (1966) include the 
integrals 


[oxte-*f (x) ax [oe *H(e) ax fin(—) 40x) a (5.6.10) 


and others. The constant a > —1. There are additional books containing tables 
for integrals other than those in (5.6.10). In addition, the paper by Golub and 
Welsch (1969) describes a procedure for constructing the nodes and weights in 
(5.6.10), based on solving a matrix eigenvalue problem. A program is given, and 
it includes most of the more popular weighted integrals to which Gaussian 
quadrature is applied. For an additional discussion of Gaussian quadrature, with 
references to the literature (including tables and programs), see Davis and 
Rabinowitz (1984, pp. 95-132, 222-229). 


Example We illustrate the use of Gaussian quadrature for evaluating integrals 
[= . x) dx 
fax) 
We use Gauss—Laguerre quadrature, in which w(x) = e~~*. Then write J as 


I= [re*les(s)] dx = [rets) dx (5.6.11) 


We give results for three integrals: 


2 
Pr ey bgp cea 
e*~-1 6 
1® = [~ xdx ; _l 
0 (1+x?) 8 
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Table 5.22 Examples of Gauss—Laguerre quadrature 


Nodes yoy — 7) ye — ye 7?) — JO 
2 1.01E - 4 —8.05E — 2 7.715E — 2 

4 1.28E —'5 —4.20E — 2 6.96E — 2 

8 —9.48E — 8 1.27E.— 2 3.70E — 2 
16 3.16E — 11 —1.39E - 3 L.71E — 2 
32 TALE — 14 3.05E — 5 8.31E — 3 
64 —_ 1.06E —- 7 4.07E — 3 


Gauss—Laguerre quadrature is best for integrands that decrease exponentially as 
x — oo. For integrands that are O(1/x’), p> 1, as x — oo, the convergence 
rate becomes quite poor as p — 1. These comments are illustrated in Table 5.22. 
For a formal discussion of the convergence of Gauss—Laguerre quadrature, see 
Davis and Rabinowitz (1984, p. 227). 

One especially easy case of Gaussian quadrature is for the singular integral 


1 oe ax 
eS 


I(f)= (5.6.12) 


With this weight function, the orthogonal polynomials are. the Chebyshev poly- 
nomials {7,,(x), n => 0}. Thus the integration nodes in (5.6.10) are given by 


2j-1 
Xin 005 2n % J=l,....0 (5.6.13) 
and from (5.3.11), the weights are 
w : 
Wien = Kn J=1, 2n 


Using the formula (5.3.10) for the error, the Gaussian quadrature formula for 
(5.6.12) is given by 


-1yl — x? 


for some —1 << 1. 
This formula is related to the composite midpoint ile (5.2.18). Make the 
change of variable x = cos @ in (5.6.14) to obtain 


f FOS EE Mod + sae) (5638 


[1(cos 0) a0 = ~ Esf(cos,,)+E (5.6.5) 


j=l 


where 6, , = (2 — 1)7/2n. Thus Gaussian quadrature for (5.6.12) is equivalent 
to the composite midpoint rule applied to the integral on the left in (5.6.15). Like 
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the trapezoidal rule, the midpoint rule has an error expansion very similar to that 
given in (5.4.9) using the Euler-MacLaurin formula. The Corollary 1 to Theorem 
5.5 also is valid, showing the composite midpoint rule to be highly accurate for 
periodic functions. This is reflected in the high accuracy of (5.6.14). Thus 
Gaussian quadrature for (5.6.12) results in a formula that would have been 
reasonable from the asymptotic error expansion for the composite midpoint rule 
applied to the integral on the left of (5.6.15). 


Analytic treatment of singularity Divide the interval of integration into two 
parts, one containing the singular point, which is to be treated analytically. For 
example, consider 


I= f'f(x)l0(x) ax = if a f°] 1) 108) dx = 1, +1, (5.6.16) 


Assuming /{(x) is smooth on [e, 6], apply a standard technique to the evaluation 
of J,. For f(x) about zero, assume it has a convergent Taylor series on 0, e]. 
Then 


L= [4(2) log (x) dx = f\ ed log (x) dx 


- Lay hos = aA (5.6.17) 


For example, with 
I= J" cos (x) log (x) dx 
0 
define 
I, = J cos (x) log (x) dx e=.] 
0 


3 ee 


finale = [les (e) ~ 5] + Gelioe(e) - =] -- 


This is an alternating series, and thus it is clear that using the first three terms 
will give a very accurate value for J,. A standard method can be applied to J, on 
{.1, 4a]. 

Similar techniques can be used with infinite intervals of integration [a, 00), 
discarding the integral over [b,0o) for some large value of b. This is not 
developed here. , 


Product integration Let I(f) = {?w(x)f(x) dx with a near-singular or singu- 
lar weight function w(x) and a smooth function f(x). The main idea is to 
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produce a sequence of functions /,(x) for which 


1. lf Salle = Max I(x) ~4,(2)| 70 as n> 0 
2. The integrals 


Cf) = [w(x dal) a (5.6.18) 


can be fairly easily evaluated. 


This generalizes the schema (5.0.2) of the introduction. For the error, 


Wf) - 1,(f)| < [I@) ~ f,(x) | dx 


< If Salles f "|w(x) fax (5.6.19) 


Thus I,({) — I(f) as n — oo, and the rate of convergence is at least as rapid as 
that of f,(x) to f(x) on [a, b]. 

Within the preceding framework, the product integration methods are usually 
defined by using piecewise polynomial interpolation to define f(x) from f(x). 
To illustrate the main ideas, while keeping the algebra simple, we will define the 
product trapezoidal method for evaluating 


b 
if) = [ f(x) log (x) dx (5.6.20) 
Letn>1,h=b/n, x;= jh for j = 0,1,..., 1. Define f,(x) as the piecewise 
linear function interpolating to f(x) on the nodes xg, x,,..-,X,. Forx;.,<x < 
Xj, define 


1 
f(x) = rales = x)f(x;-1) a aoe x;-1) f(x,)] (5.6.21) 
for j = 1,2,..., 1. From (3.1.10), it is straightforward to show 


h? ; 
Ales lr te (5.6.22) 


provided f(x) is twice continuously differentiable for 0 < x < b. From (5.6.19) 
we obtain the error bound 


h2 
A) = LC) S ZA “lee f How (x) |x (5.6.23) 


This method of defining J,(/) is similar to the regular trapezoidal rule (5.1.5). 
The rule (5.1.5) could also have been obtained by integrating the preceding 
function f,(x), but the weight function would have been simply w(x) = 1. We 
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can easily generalize the preceding by using higher degree piecewise polynomial 
interpolation. The use of piecewise quadratic interpolation to define f(x) leads 
to a formula JI,(f) called the product Simpson’s rule. And using the same 
reasoning as led to (5.6.23), it can be shown that 


UN - HDs SUM loeolae (6.628 


Higher order formulas can be obtained by using even higher degree interpolation. 
For the computation of J,(f) using (5.6.21), 


ca( 2) (x; — x) f(aj-1) + (& = 4-4) | a2 


rae; om ye ae 
= ¥ ie ; 


n 


LU wf (x) , (5.6.25) 


k=0 


1 -x 1 px, 
w= f(x x)log(x) de w= =f" (x — x,a) log (x) de, 
Xo h Xn- 


h 
ee 
w; = rah (x — x;-3) log (x) dx 
Px 
i ys 
oy (x41 —x}log(x) de f=1,....0-1 (5.6.26) 


The calculation of these weights can be simplified considerably. Making the 
change of variable x — x;_, = uh, 0 <u <1, we have 


1 px 
a — Xj-1) log (x) dx = a fiw log{(j —1+ u)h] du 


h 
= = log(h) +h f'wlog(j~-1 + u) du 
2 0 
and 


ae — x) log (x) oe afr —u)log{(j-1+u)h] du 


h 
= = log(h) + hf "(1 — u)log(j- 1+ u) du 
2 A | 
Define 


b4(k) = fu log(u+k)du 4 (k) = fa —u)log(u +k) du (5.6.27) 


op nN FIs 
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Table 5.23 Weights for product trapezoidal rule 


¥(k) W2(k) 

0 —.250 ~.750 - 

1 .250 1362943611 
2 4883759281 4211665768 
3 6485778545 6007627239 
4 -1695705457 .7324415720 
5 8668602747 .8365069785 
6 9482428376 : .9225713904 
7 1.018201652 -9959596385 


for k = 0,1,2,.... Then 
h h 
Wo = z loa (h) + hv, (0) W, = 5 log(h) + hy,(n - 1) 


w, = A log (A) + ALY — 1) + (J) j=1,2,...,2-—1 (5.6.28) 


The functions ~,(k) and W,(k) do not depend on hk, 5b, or n. They can be 
calculated and stored in a table for use with a variety of values of b and n. For 
example, a table of ~,(k) and ¥,(k) for k = 0,1,...,99 can be used with any 
b > 0 and with any n < 100. Once the table of the values ¥,(k) and ,(k) has 
been calculated, the cost of using the product trapezoidal rule is no greater than 
the cost of any other integration rule. 

The integrals ~,(k) and ¥,(k) in (5.6.27) can be evaluated explicitly; some 
values are given in Table 5.23. 


Example Compute J = {}(1/(x + 2))log(x) dx = —.4484137. The computed 
values are given in Table 5.24. The computed rate of convergence is in agreement 
with the order of convergence of (5.6.23). 

Many types of interpolation may be used to define f(x), but most applica- 
tions to date have used piecewise polynomial interpolation on evenly spaces node 
points. Other weight functions may be used, for example, 


w(x)=x" a>-1l x20 (5.6.29) 


and again the weights can be reduced to a fairly simple formula similar to 


Table 5.24 Example of product trapezoidal rule 


I, I-f, Ratio 
— 4583333 00992 
— 4516096 .00320 3.10 
— .4493011 .000887 3.61 
— 4486460 000232 3.82 
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(5.6.28). For an irrational valué of «, say 
w(x) = 


a change of variables can no longer be used to remove the singularity in the 
integral. Also, one of the major applications of product integration is to integral 
equations in which the kernel function has an algebraic and/or logarithmic 
singularity. For such equations, changes of variables are no longer possible, even 
with square root singularities. For example, consider the equation 


Ap(x) — pee =f(x) a<x<b 


with A, a, b, and f given and @ the desired unknown function. Product 
integration leads to efficient procedures for such equations, provided p(y) is a 
smooth function [see Atkinson (1976), p. 106]. 

For complicated weight functions in which the weights w, can no longer be 
calculated, it is often possible to modify the problem to one in which product 
integration is still easily applicable. This will be examined using an example. 


Example Consider I = {3f(x) log (sin x) dx. The integrand has a singularity at 
both x = 0 and x = w. Use 


in (x) 


log (sin x) = oe| Ga ed | + tos (x) + ew - 2) 


and this gives 


I= [702 68| = Cr sey |e [1 toe 2) 


+ ['f(=)l08 (a ~ x) dx 


=I,+1,+1, (5.6.30) 
Integral J, has an infinitely differentiable integrand, and any standard numerical 


method will perform well. Integral J, has already been discussed, with w(x) = 
log (x). For J,, use a change of variable to write 


I, = ['1(@) 08 ( —x)dx= [fe — z)log(z) d& 
Combining with J, 


I, + I= f'log(x)[ f(x) + f(m ~ x)] ax 
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to which the preceding work applies. By such manipulations, the applicability of 
the cases w(x) = log.(x) and w(x) = x* is much greater than might first be 
imagined. 


For an asymptotic error analysis of product integration; see the work of 
de Hoog and Weiss (1973), in which some generalizations of the Euler- 
MacLaurin expansion are derived. Using their results, it can be shown that the 
error in the product Simpson rule is O(h* log(h)). Thus the bound (5.6.24) based 
on the interpolation error f(x) — f,(x) does not predict the correct rate of 
convergence. This is similar to the result (5.1.17) for the Simpson rule error, in 
which the error was smaller than the use of quadratic interpolation would lead us 
to believe. , 


5.7 Numerical Differentiation 


Numerical approximations to derivatives are used mainly in two ways. First, we 
are interested in calculating derivatives of given data that are often obtained 
empirically. Second, numerical differentiation formulas are used in deriving 
numerical methods for solving ordinary and partial differential equations. We 


‘begin this section by deriving some of the most commonly used formulas for 


numerical differentiation. . 

The problem of numerical differentiation is in some ways more difficult than 
that of numerical integration. When using empirically determined function 
values, the error in these values will usually lead to instability in the numerical! 
differentiation of the function. In contrast, numerical integration is stable when 
faced with such errors (see Problem 13). 


The classical formulas One of the main approaches to deriving a numerical 
approximation to f(x) is to use the derivative of a polynomial p,(x) that 
interpolates f(x) at a given set of node points. Let xp, x,,...,x,, be given, and 
let p,,(x) interpolate f(x) at these nodes. Usually {x,;} are evenly spaced. Then 
use 


f(x) = pax) (S271) 
From (3.1.6), (3.2.4), and (3.2.11): 


palx) = ¥f(x)4 (x) 
j~=0 

Cx) 

WO) = Ga aee) 
= (ete) eo (a = aE = pe) (eH) 

(x, —Xq)°°> (x; = Xj-1)(x; = Xj41) re (x, — x,) 

W268) es) 

f(z) — plz) = ¥(2)f [0.0000 tee 2] (5.7.2) 
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Thus 


(apes D fx (x) = DMs) (5.7.3) 


f(x) — Dyf(x) = Bix) f D0 +++ Xn x] 
+¥,(x)f[x9,---, Xp» XX] (5.7.4) 
with the last step using (3.2.17). Applying (3.2.12), 


(n+1) (n+2) 
f'(x) - Df (x) = Hey + lx) (5.75) 


with £,, £, € {x ,..., X,, x}. Higher order differentiation formulas and their 
error can be obtained by further differentiation of (5.7.3) and (5.7.4). 

The most common application of the preceding is to evenly spaced nodes 
{ x;}. Thus let 


X;=Xq + ih iz0 
with h > 0. In this case, it is straightforward to show that 
V(x) = O(h™) —-¥i(x) = O(n") (5.7.6) 
Thus . 


O(h") W(x) #0 


f(x) — p(x) = ee Vi (x) =0 (5.7.7) 


We now derive examples of each case. 
Let n= 1, so that p,(x) is just the linear interpolate of (x9, f(x)) and 


(x,, f(%,)). Then (5.7.3) yields 
1 
f'(xo) = Dif (xo) = 7 lf(%o +h) — f(x»)] (5.7.8) 
From (5.7.5), 
h 
f(xo) ~ Dif(x0) = 5/8) Xo S81 SM (5.7.9) 


since ¥(x9) = 0. 
To improve on this with linear interpolation, choose x = m = (xg + x,)/2.. 


Then 


3 
f'(m) =< [f() - fo] 
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We usually rewrite this by letting 6 = 4/2, to obtain 


fm) = Dyf(m) = sxLf(m-+8)—f(m—8)] (5.7.10) 


For the error, using (5.7.5) and ¥/(m) = 0, 


—f2 


- fO(E) = m—-8<t,<m+8 (5.7.11) 


f'(m) — Dsf(m) = 


In general, to obtain the higher order case in (5.7.7), we want to choose the nodes 
{x,;} to have ¥/(x) = 0. This will be true if 1 is odd and the nodes are placed 
symmetrically about x, as in (5.7.10). 

To obtain higher order formulas in which the nodes all lie on one side of x, 
use higher values of n in (5.7.3). For example, with x = x, and n = 2, 


f"(%9)* Daf ls) = 5 [-3fl%0) + 4f(en) —Fea)] (6.722) 


h2 
f'(Xo) — Dy f(x) = Zh) Xp <6, SX, (5.7.13) 


The method of undetermined coefficients Another method to derive formulas 
for numerical integration, differentiation, and interpolation is called the method 
of undetermined coefficients. It is often equivalent to the formulas obtained from 
a polynomial interpolation formula, but sometimes it results in a simpler deriva- 
tion. We will illustrate the method by deriving a formula for f(x). 

Assume 


f’ (x) = DEP F(x) = Af(x +h) + B(x) + Cf(x-—h) (5.7.14) 


with A, B, and C unspecified. Replace f(x +h) and f(x — h) by the Taylor 
expansions : , 


h? 3 ht 
f(xth) =f (x) + hf'(x) + sh'"(x) ea Zits) A: maf Es) 


with x-h<€ <x<t, <x +h. Substitute into (5.7.14) and rearrange 
into a polynomial in powers of h: 


Af(x +h) + B(x) + Cf(x-h) 


=(A+B+C)f(x)+A(A-C)f(x) + (A + C)f"(x) 


+ 7 — C)f(x) + * [af (€.) + Bf(E_)] (5.7.15) 
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In order for this to equal f’(x), we set 
2 
A+B+C=0 A-C#=0 A+C=G 
The solution of this system is 
1 
A=C= ra B= -—> (5.7.16) 


This yields the formula 


DP f(x) = Lose tos fen (5.7.17) 


For the error, substitute (5.7.16) into (5.7.15) and use (5.7.17). This yields 


f(x) ~ DPA) = LF.) + FEI] 


Using Problem 1 of Chapter 1, and assuming f(x) is four times continuously 
differentiable, 


h2 
f(x) — DPH(x) = — BPO) (5.7.18) 


for some x —h < € <x +h. Formulas (5.7.17) and (5.7.18) could have been 
derived by calculating p4’(x) for the quadratic polynomial interpolating f(x) at 
x~—h, x, x +h, but the preceding is probably simpler. 

~The general idea of the method of undetermined coefficients is to choose the 
Taylor coefficients in an expansion in A so as to obtain the desired derivative (or 
integral) as closely as possible. 


Effect of error in function values The preceding formulas are useful when 
deriving methods for solving ordinary and partial differential equations, but they 
can lead to serious errors when applied to function values that are obtained 
empirically. To illustrate a method for analyzing the effect of such errors, we 
consider, the second derivative approximation (5.7.17). 

Begin by rewriting (5.7.17) as 


with x; = x + jh. Let the actual values used be f, with 


f(x) =fite¢, = 0,1,2 : (5.7.19) 
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The actual numerical derivative computed is 


BP F(x) = hath *h (5.7.20) 


For its error, substitute (5.7.19) into (5.7.20), obtaining 


f(x) 7 DP F(x) = f""(x,) - cei) 


€, —2€, + € 
ig 2 . 0 
h 
€, — 2€; + € 


2 (5.7.21) 


—f2 
= maa Ae) + 


For the term involving {¢;}, assume these errors are random within some 
interval —E <«¢ < E. Then 


a h? 4E 
[f"(a) ~ BP) Ls TIPO) + oa (5.7.22) 


and the last bound would be attainable in many situations. An example of such 
errors would be rounding errors, with E a bound on their magnitude. 

The error bound in (5.7.22) will initially get smaller as h decreases, but for h 
sufficiently close to zero, the error will begin to increase again. There is an 
optimal value of h, call it A*, to minimize the right side of (5.7.22), and 
presumably there is a similar value for the actual error f’(x,) — D@f(x,). 


Example Let f(x) = —cos(x), and compute f’(0) using the numerical ap- 


proximation (5.7.17). In Table 5.25, we give the errors in (1) D{/(0), computed 
exactly, and (2) D/(0), computed using 8-digit rounded decimal arithmetic. In 


Table 5.25 Example of Df(0) and DF) 


h f"O - DP FO) Ratio f"(0) — DP F(0) 
PS 2.07E — 2 2.07E — 2 
25 : 5.20E — 3 3.98 §.20E — 3 
125 ; 1.30E — 3 3.99 N 1.30E - 3 
0625 3.25E — 4 4.00 3.25E — 4 
.03125 8.14E — 5 4.00 8.45E — 5 
.015625 . 2.03E — 5 4.00 . 2.56E — 6 
.0078125 5.09E — 6 . 4.00 —T.94E — 5 
.00390625 1.27E — 6 4.00 —7.94E — 5 


001953125 3.18E — 7 4.00 —1.39E — 3 
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this last case, 


2 Ae 210" 
[f'O) = DEO) |e (5.7.23) 
12 h 
This bound is minimized at h* = .0022, which is consistent with the errors 
{’(0) — D®f(0) given in the table. For the exactly computed D@/(0), note that 
the errors decrease by four whenever A is halved, consistent with the error 
formula (5.7.18). 


Discussion of the Literature 


Even though the topic of numerical integration is one of the oldest in numerical 
analysis and there is a very large literature, new papers continue to appear at a 
fairly high rate. Many of these results give methods for special classes of 
problems, for example, oscillatory integrals, and others are a response to changes 
in computers, for example, the use of vector pipeline architectures. The best 
survey of numerical integration is the large and detailed work of Davis and 
Rabinowitz (1984). It contains a comprehensive survey of most quadrature 
methods, a very extensive bibliography, a set of computer programs, and a 
bibliography of published quadrature programs. It also contains the article “On 
the practical evaluation of integrals” by Abramowitz, which gives some excellent 
suggestions on analytic approaches to quadrature. Other important texts in 
numerical integration are Engels (1980), Krylov (1962), and Stroud (1971). For a 
history of the classical numerical integration methods, see Goldstine (1977). 

For reasons of space, we have had to omit some important ideas. Chief among 
these are (1) Clenshaw-Curtis quadrature, and (2) multivariable quadrature. The 
former is based on integrating a Chebyshev expansion of the integrand; em- 
pirically the method has proved excellent for a wide variety of integrals. The 
original method is presented in Clenshaw and Curtis (1960); a current account of - 
the method is given in Piessens et al. (1983, pp. 28-39). The area of multivariable 
quadrature is an active area of research, and the texts of Engels (1980) and 
Stroud (1971) are the best introductions to the area. Because of the widespread 
use of multivariable quadrature in the finite element method for solving partial 
differential equations, texts on the finite element method will often contain 
integration formulas for triangular and rectangular regions. 

Automatic numerical integration was a very active area of research in the 
1960s and 1970s, when it was felt that most numerical integrations could be done 
in this way. Recently, there has been a return to a greater use of nonautomatic 
quadrature especially adapted to the integral at hand. An excellent discussion of 
the relative advantages and disadvantages of automatic quadrature is given in 
Lyness (1983). The most powerful and flexible of the current automatic programs 
are probably those given in QUADPACK, which is discussed and illustrated in 
Piessens et al. (1983). Versions of QUADPACK are included in the IMSL and 
NAG libraries. 
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For microcomputers and hand computation, Simpson’s rule is still popular 
because of its simplicity. Nonetheless, serious consideration should be given to 
Gaussian quadrature because of its much greater accuracy. The nodes and 
weights are readily available, in Abramowitz and Stegun (1964) and Stroud and 
Secrest (1966), and programs for their calculation are also available. 

Numerical differentiation is an ill-posed problem in the sense of Section 1.6. 
Numerical differentiation procedures that account for this have been developed 
in the past ten to fifteen years. In particular, see Anderssen and Bloomfield 
(1974a), (1974b), Cullum (1971), Wahba (1980), and Woltring (1986). 
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Problems 


1. Write a program to evaluate J = f,?f(x) dx using the trapezoidal rule with 
n subdivisions, calling the result £,. Use the program to calculate the 
following integrals with n = 2, 4, 8,16,..., 512. 

ax 
4lt+ x? 


(a): fle* ax (b) [xP ax (c) ft 


dx 


20 7 
(d) [ Teeny. (e) - e* cos (4x) dx 


Analyze empirically the rate of convergence of J, to I by calculating 


the ratios of (5.4.27): 
b= 1 
R, ake 2n n 
Ig, =< Jon 


2. Repeat Problem 1 using Simpson’s rule. 


3. Apply the corrected trapezoidal rule (5.1.12) to the integrals in Problem 1. 
Compare the results with those of Problem 2 for Simpson’s rule. 
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4. 


As another approach to the corrected trapezoidal rule (5.1.12), use the cubic 
Hermite interpolation polynomial to f(x) to obtain 


fines) oe (252)ist@) +01 - 22? reo) - rea 


Use the error formula for Hermite interpolation to obtain an error formula 
for the preceding approximation. Generalize these results to (5.1.12), with n 
subdivisions of [a, b]. 


(a) Assume that f(x) is continuous and that f’(x) is integrable on (0, 1]. 
Show that the error in the trapezoidal rule for calculating {if(x) dx 
has the form 


EC) = [K(f (e) at 


tytt 
baler care PASS Y. APP lilae 


This contrasts with (5.1.22) in which f’(x) is continuous and f”(x) is 
integrable. 


(b) Apply the result to /(x) = x* and to f(x) = x*log(x), 0<a<1. 
This gives an order of convergence, although it is less than the true 
order. [See Problem 6 and (5.4.23).] 


-Using the program of Problem 1, determine empirically the rate of conver- 


gence of the trapezoidal rule applied to {jx In(x) dx, 0 < a <1, with a 
range of values of a, say a = .25,.5,.75, 1.0. 


Derive the composite form of Boole’s rule, which is given as the n = 4 entry 
in Table 5.8. Develop error formulas analogous to those given in (5.1.17) 
and (5.1.18) for Simpson’s rule. 

Repeat Problem 1 using the composite Boole’s rule obtained in Problem 7. 
Let p,(x) be the quadratic polynomial interpolating f(x) at x = 0, A, 2h. 


Use this to derive a numerical integration formula J, for J = f3"f(x) dx. 
Use a Taylor series expansion of f(x) to show 


3 
I-I,= grt) + O(h°) 
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10. For the midpoint integration formula (5.2.17), derive the Peano kernel error 


il. 


12. 


13. 


formula 


['s2) ax - (3) SR OROL 


ss ‘ h 
—t <{<-—- 
2 Towee 29 
K(t) = i h 
=(h=1yf —<t2h: 
5 ( ) 2 


Use this to derive the error term in (5.2.17). 


Consider functions defined in the following way. Let n > 0, h = (b — a)/n, 
t;=a+ jh for j=0,1,...,n. Let f(x) be lear on each subinterval 
{t;-1,t,], for j = 1,...,. Show that the set of all such f, for all n > 1, is 
dense in C[a, b]. 


Let w(x) be an integrable function on [a, b], 
b 
f |w(x)|dx < 0 
a 
and let 
b Le 
fw) dx = YS (%j,0) 
a j=l 
be a sequence of numerical integration rules. Generalize Theorem 5.2 to 
this case. 
Assume the numerical integration rule 
b be ten 
[P(@) dx = Diy SX 0). 
a j=l 
is convergent for all continuous functions. Consider the effect of errors in 
the function values. Assume we use f; = f(x;), with 
(x, -fl<e Isisn 


What is the effect on the numerical integration of these errors in the 
function values? 


Apply Gauss—Legendre quadrature to the integrals in Problem 1. Compare 
the results with those for the trapezoidal and Simpson methods. 


Use Gauss—Legendre quadrature to evaluate {*,dx/(1 + x?) with the 
n = 2,4,6,8 node-point formulas. Compare with the results in Table 5.9, 
obtained using the Newton—Cotes formula. 
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16. 


17. 


18. 


19. 
20. 


21. 


22. 


Prove that the Gauss— Legendre nodes and weights on [—1, 1] are peace 
cally distributed about the origin x = 0. 


Derive the two-point Gaussian quadrature formula for 


if) = [terre dx 


in which the weight function is w(x) = log(1/x). Hint: See Problem 20(a) 
of Chapter 4. Also, use the analogue of (5.3.7) to compute the weights, not 
formula (5.3.11). 


Derive the one- and two-point Gaussian quadrature formulas for 
1 . a. & 
I= foxf(x)dx = D ws (x;,) 
0 j=l 


with weight function w(x) = x. 


For the integral J = {1,V1 — x?f(x) dx with weight w(x) = yl — x’, 
find explicit formulas for the nodes and weights of the Gaussian quadrature 
formula. Also give the error formula. Hint: See Problem 24 of Chapter 4. 


Using the column e,, of Table 5.13, produce the fourth-order error 
formula analogous to the second-order formula (5.3.40) for e, .. Compare 
it with the fourth-order Simpson error formula (5.1.17). 


The weights in the Kronrod formula (5.3.48) can be calculated as the 
solution of four simultaneous linear equations. Find that system and then 
solve it to verify the values given following (5.3.48). Hint: Use the approach 
leading to (5.3.7). 

Compare the seven-point Gauss—Legendre formula with the seven-point 
Kronrod formula of (5.3.48). Use each of them on a variety of integrals, 
and then compare their respective errors. 


(a) Derive the relation (5.4.7) for the Bernoulli polynomials B,(x) and 
Bernoulli numbers B,. Show that B; = 0 for all odd integers j > 3. 


(b) Derive the identities 
B(x) =jB,_,(x) j= 4andeven 
Bi(x) =j[Ba(x) + By] 7 = 3 and odd 


These can be used to give a general proof of the Euler-MacLaurin 
formula (5.4.9). 


25. 


26. 


27. 
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Using the Euler-MacLaurin summation formula (5.4.17), obtain an esti- 
mate of {(2), accurate to three decimal places. The zeta function s( Pp) is 
defined in (5.4.22). 


Obtain the asymptotic error formula for the trapezoidal rule applied to 
te Vxf (x) dx. Use the estimate from Problem 24. 


Consider the following table of approximate integrals I, produced using 
Simpson’s rule. Predict the order of convergence of J, to I: 


I 


n 
2 .28451779686 
4 28559254576 
8 28570248748 

16 28571317731 

32 28571418363 

64 -28571427643 


That is, if J — I, = c/n’, then what is p? Does this appear to be a valid 
form for the error for these data? Predict a value of c and the error in I¢,. 
How large should n be chosen if J, is to be in error by less than 107"? 


Assume that the error in an integration formula has the asymptotic expan- 
sion 
CQ C, C; ¢ 
oe ae ee 
nyn n nn on 
Generalize the Richardson extrapolation process of Section 5.4 to obtain 
formulas for C,; and C,. Assume that three values J,, J,,, and I,, have 


been computed, and use these to compute C,, C,, and an estimate of J, 
with an error of order 1/n? 


For the trapezoidal rule (denoted by J") for evaluating J = [?f(x) dx, we 
have the asymptotic error formula ; 


h2 
1-1 = ~S1'() -F(@)] + 0(04) 
and for the midpoint formula I{”, we have 


T-1= aTzO) — f'(a)] + O(n‘) 


provided f is sufficiently differentiable on [a, b}. Using these results, obtain 
a new numerical integration formula [, combining I and 1”), with a 
higher order of convergence. Write out ‘the weights to the new formula 1 
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29. 


31. 


32. 


33. 


35. 


37. 


Obtain an asymptotic error formula for Simpson’s rule, comparable to the 
Euler—MacLaurin formula (5.4.9) for the trapezoidal rule. Use (5.4.9), 
(5.4.33), and (5.4.36), as in (5.4.37). 


Show that the formula (5.4.40) for I is the composite Boole’s rule. See 
Problem 7. 


Implement the algorithm Romberg of Section 5.4, and then apply it to the 
integrals of Problem 1. Compare the results with those for the trapezoidal 


‘and Simpson rules. 


Consider evaluating J = {ix* dx by using the adaptive Simpson’s rule that 
was described following (5.5.3). To see that the error test (5.5.5) may fail, 
consider the case of ~1 < a < 0, with the integrand arbitrarily set to zero 
when x = 0. Show that for sufficiently small ¢, the test (5.5.5) will never be 
satisfied for the subinterval [0, h]. Note that for e specified on [0,1], the 
error tolerance for [0, A] will be eh. 


Use Simpson’s rule with even spacing to integrate J = fo log (x) dx. For 
x = 0, set the integrand to zero. Compare the results with those in Tables 
5.19 and 5.20, for integral J,. 


Use an adaptive integration program [e.g., DQAGP from Piessens et al. 
(1983)] to calculate the integrals in Problem 1. Compare the results with 
those of Problems 1, 2, 14, and 31. 


Decrease the singular behavior of the integrand in 
p= [°f(x) log (x) dx 
) 


by using the change of variable x = 1’, r > 0. Analyze the smoothness of 
the resulting integrand. Also explore the empirical behavior of the 
trapezoidal and Simpson rules for various r. 


Apply the IMT method, described following (5.6.5), to the calculation of 
I = {[o°f(x) dx, using some change of variable from [0,0o) to a finite 
interval. Apply this to the calculation of the integrals following (5.6.11). 


Use Gauss—Laguerre quadrature with n = 2, 4,.6, and 8 node points to 
evaluate the following integrals. Use (5.6.11) to put the integrals in proper 
form. 


Vr ae x dx i 
Y 0 GQ+x2P 2 


39. 


41. 


42. 


43. 
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Evaluate the following singular integrals within the prescribed error toler- 
ance. Do not use an automatic program. 


(a) J" cos(u) log (u) dus « = 1078 
0 
‘3 1 ; 
(b) f" x? sin —) e= 1073 
0 x 


1 cos (x) dx 
[a ae 


Use an adaptive program (e.g., DQAGP) to evaluate the integrals of 
Problem 38. 


(a) Develop the product trapezoidal integration rule for calculating 


ap = 


0 x 


0<a<l 


Put the weights into a convenient form, analogous to (5.6.28) for the 
product trapezoidal rule when w(x) = log(x). Also, give an error 
formula analogous to (5.6.24). . 


(b) Write a simple program to evaluate the following integrals using the 
' results of part (a). 
ar tpl -€ wy fh 
(i) f sie dx (ii) f sin (x2/*) 
Hint: For part (ii), first let u = x?/”. ° 

To show the ill-posedness of the differentiation of a function y(t) on an 
interval [0, 1], consider calculating x(t) = y(t) and x,(t) = y/(2), with 

1 

Y(t) =y(t)+ —0" nel 
n 


Recall the definition of ill-posed and well-posed problems from Section 1.6. 


_ Use the preceding construction to show that the evaluation of x(t) is 


unstable relative to changes in y. For measuring changes in x and y, use 
the maximum norm of (1.1.8) and (4.1.4). 


Repeat the numerical differentiation example of (5.7.23) and Table 5.25, 
using your own computer or hand calculator. 


Derive error results for D;f(m) in (5.7.10) relative to the effects of 
rounding errors, in analogy with the results (5.7.21)~(5.7.22) for D f(x). 
Apply it to f(x) = e* at x = 0, and compare the results with the actual 
results of your own hand computations. 


I 


NUMERICAL METHODS 
FOR ORDINARY 
DIFFERENTIAL EQUATIONS 


Differential equations are one of the most important mathematical tools used in 
modeling problems in the physical sciences. In this chapter we derive and analyze 
numerical methods. for solving problems for ordinary differential equations. The 
main form of problem that we study is the initial value problem: : 


y=f(x.y) (0) =% (6.0.1) 


The function f(x, y) is to be continuous for all (x, y) in some domain D of. the 
xy-plane, and (Xo, Y)) is a point in D. The results obtained for (6.0.1) will 
generalize in a straightforward way to both systems of differential equations and 
higher order equations, provided appropriate vector and matrix notation is used. 
These generalizations are discussed and illustrated in the next two sections. 

We say that a function Y(x) is a solution on [a, b] of (6.0.1) if for all 
a<x<b, 


1. (x, ¥(x)) € Dz 
2. Y(Xo) = Yo. 
3. Y(x) exists and ¥(x) = f(x, ¥(x)). 


As notation Artaets this chapter, Y(x) will denote the true solution of 
_whatever differential equation problem is being considered: 


Example 1. The general first-order linear differential equation is 
y=a(x)y+e(x) axx<b 


in which the coefficients a9(x) and g(x) are assumed to be continuous on [a, 5]. _ 
The domain D for this problem is 


D= {(x, yla<x<b,-—00 <y< 0} 


The exact solution of this equation can be found in any elementary differential 
equations textbook [e8- Boyce and Diprima (1986), p. 13]. As a special case, 
consider 


y=Aytea(x) O<sx<0 (6.0.2) 
333 


334 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS 
with g(x) continuous on [0, 00). The solution that satisfies Y(0) = Yo is given by 


Y(x) = Yer + freremPg(t) dt 0O<x< 0 (6.0.3) 
0 


This is used later to illustrate various theoretical results. 


2. The equation y’ = —y? is nonlinear. One of its solutions is Y(x) = 0, and 
the remaining solutions have the form 


Le) Sa 


with c an arbitrary constant. Note that | ¥(—c)| = oo. Thus the global smooth- 
ness of f(x, y) = —y? does not guarantee a similar behavior in the solutions. 


To obtain some geometric insight for the solutions of a single first-order 
differential equation, we can look atthe direction field induced by the equation 
on the xy-plane. If Y(x) is a solution that passes through (xo, Yo), then the slope 
of Y(x) at (Xo, Yo) is ¥(x9) = f(%o, Yo). Within the domain D of f(x, y), pick 
a representative set of points (x, y).and then draw a short line segment with 
slope f(x, y) through each (x, y). 


Example Consider the equation y’ = —y. The direction field is given in Figure 
6.1, and several representative solutions have been drawn in. It is clear from the 
graph that all solutions Y(x) satisfy 
Limit Y(x) = 0 
x oO 
To make it easier to draw the direction field of y’ = f(x, y), look for those 
curves in the xy-plane along which f(x, y) is constant. Solve 


f(x,y) =e 


& 
Figure 6.1 Direction field for y’ = —y. 
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Jy 


Figure 6.2 | Direction field for y’ = x? + y?. 


for various choices of c. Then each solution Y(x) of y’ = f(x, y) that intersects 
the curve f(x, y) = satisfies Y’(x) = c at the point of intersection. The curves 
f(x, y) = ¢ are called level curves of the differential equation. 


Example Sketch the qualitative behavior of solutions of 
yar’ ty? 

The level curves of the equation are given by 

x*+y*=c c>0 
which are circles with center (0,0) and radius vc. The direction field with some 
representative solutions is given in Figure 6.2. Three circles are drawn, with 
c = 4,1,4. This gives some qualitative information about the solutions Y(x), and 
occasionally this can be useful. 
6.1 Existence, Uniqueness, and Stability Theory 
From the examples of the introduction it should be intuitive that in most cases 
the initial value problem (6.0.1) has a unique solution. Before beginning the 


numerical analysis for (6.0.1), we present some theoretical results for it. Condi- 
tions are given to ensure that it has a unique solution, and we consider the 
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stability of the solution when the initial data Y) and the derivative f(x, y) are 
changed by small amounts. These results are necessary in order to better 
understand the numerical methods presented Jater. The domain D will hence- 
forth be assumed to satisfy the following minor technical requirement: If two 
points (x, y,) and (x, y,) both belong to D, then the vertical line segment joining 
them is also contained in D. 


Theorem 6.1 Let f(x, y) be a continuous function of x and y, for all (x, y) in 
D, and let (x9, Yo) be an interior point of D. Assume f(x, y) 
satisfies the Lipschitz condition 


f(x, 1) — f(x, 2) |< Ki -—y| all (x, 4), (x, ») in D (6.1.1) 


for some K>0. Then for a suitably chosen interval J = 
[x9 — &, Xo + a], there is a unique solution Y(x) on J of (6.0.1). 


Proof The proof of this result can be found in most texts on the theory of 
ordinary differential equations [e.g., see Boyce and Diprima (1986), 
p. 95]. For that reason, we omit it. | 


Showing the Lipschite i is straightforward if 0f(x, y)/dy exists and is bounded 
on D. Simply let 


f(x, y) 


K= : 6.12 
(x, y)ED oy “ ( ) 
Then using the mean value theorem 
af(x, €) 
f(x, 1) — f(s 2) = Fy —y2) 


for some £ between y, and y,. Result (6.1.1) follows immediately using _ 1.2). 


. Pidaph The following two examples illustrate the theorem. 


1. Consider y’ = 1+ sin (xy) with 
D= {(x, yJO<x<1,-0<y<o} 
To compute the Lipschitz constant K, use (6.1.2). Then 


af(x, y) 


=x-cos(xy) K=1. 
dy 


Thus for any (xo, Yj) with 0 < x9 < 1, there is a solution Y(x) to the associated 
initial value problem on some interval [x9 — @ Xo + a] c [0, 1]. 
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2. Consider the problem 
y=azy? y(0)=1 


for any constant a > 0. The solution is 


2 


Y(x) = —a<x<a 


a? — 
The solution exists on only a small interval if a is small. To determine the 
Lipschitz constant, we must calculate 


To have.a finite Lipschitz constant on D, the region D must be bounded in x 
and y, say —c <x <cand —b <y < b. The preceding theorem then states that 
there is a solution Y(x) on some interval -a < x <a, witha <c. 

‘Lest it be thought that we can evaluate the partial derivative at the initial 
point (Xo, Yo) = (0,1) and obtain sufficient information to estimate the Lipschitz 
constant K, note that 


af(0,1) _ 
aye 


for any a > 0. 

Stability of the solution The stability of the solution Y(x) is examined when the 
initial value problem is changed by a small amount. This is related to the 
discussion of stability in Section 1.6 of Chapter 1. We consider the perturbed 
problem 


y' = f(x, y) + 8(x) 


(6.1.3) 
(xX) = K+ , 
with the same hypotheses for f(x, y) as in Theorem 6.1. Furthermore, we assume 
6(x) is continuous for all x such that (x, y) © D for some y. The problem 
(6.1.3) can then be shown to have a unique solution, denoted by Y(x; 6, €). 


Theorem 6.2 Assume the same hypotheses as in Theorem 6.1. Then the problem 
(6.1.3) will have a unique solution Y(x; 6,¢€) on an interval 
[Xo — &, X9 + a], some a > 0, uniformly for all perturbations « 
and 6(x) that satisfy 


le] Seo HSI. S €0 


for €, sufficiently small. In addition, if Y(x) is the solution of the 
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unperturbed problem, then 


Max |¥(x) — Y¥(x; 6,€)| < Affe] + alf6i],] (6.1.4) 


|x—xglSa@ 
with k = 1/(1 — aK), using the Lipschitz constant K of (6.1.1). 


Proof The derivation of (6.1.4) is much the same as the proof of Theorem 6.1, 
and it can be found in most graduate texts on ordinary differential 
equations. Hi 


Using this result, we can say that the initial value problem (6.0.1) is well-posed 
or stable, in the sense of Section 1.6 in Chapter 1. If small changes are made in 
the differential equation or in the initial value, then the solution will also change 
by a small amount. The solution Y depends continuously on the data of the 
problem, namely the function f and the initial data Yo. 

It was pointed out in Section 1.6 that a problem could be stable but 
ill-conditioned with respect to numerical computation. This is true with differen- 
tial equations, although it does not occur often in practice. To better understand 
when this may happen, we estimate the perturbation in Y due to perturbations in 
the problem. To simplify our discussion, we consider only perturbations ¢€ in the 
initial value Yo; perturbations 6(x) in the equation enter into the final answer in 


-Mmuch the same way, as indicated in (6.1.4). 


Perturbing the initial value Y, as in (6.1.3), let Y(x; €) denote the perturbed 
solution. Then 


Y'(x; €) = fix, V(x; €)) Mw -~asxiuta 


6.1.5 
Y(xp€) =YHr+e ( ) 


Subtract the corresponding equations of (6.0.1) for Y(x), and let Z(x;«) = 
Y(x; €) — Y(x). Then 


Z'(x3 €) = f(x, ¥(x5 €)) — f(x, ¥(x)) 


df(x, ¥(x)) e 


ay (x; €) (6.1.6) 


and Z(x9; «) = €. The approximation (6.1.6) is valid when Y(.x; €) 1s sufficiently 
close to Y(x), which it is for small values of ¢ and small intervals [xy — a, xg + a]. 
We can easily solve the approximate differential equation of (6.1.6), obtaining 


px Of(t, Y(t) 
Z(x;«) =e- exp | —__———- dt (6.1.7) 
ie dy 
If the partial derivative satisfies 
of(t, ¥Ct : 
SNOT 2% lxo-t| Se (6.1.8) 


dy 
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then we have that Z(x, €) probably remains bounded by e as x increases. In this 
case, we say the initial value problem is well-conditioned. 
As an example of the opposite behavior, suppose we consider 


y=Ayta(x) yO0=% (6.1.9) 
with A > 0. Then df/dy =A, and we can calculate exactly 
Z(x;€) =ee* 
Thus the change in Y(x) becomes increasing large as x increases. 


Example The equation 
y'=100y-—10le"* = y(0)=1 (6.1.10) 
has the solution Y(x) = e *, The perturbed problem 
y'=100y-10le-* y(O)=1l+e 
has the solution 
Y(x;€) =e7* + ee 


which rapidly eee from the true solution. We say (6.1.10) is an ill-conditioned 
problem. 


For a problem to be well-conditioned, we want the integral 


if af(t, Y(t) 


Xp dy 


to be bounded from above by zero or a small positive number, as x increases. 
Then the perturbation Z(x; €) will be bounded by some constant times. €, with 
. the constant not too large. 

In the case that (6.1.8) is satisfied, but the partial derivative is large in 
magnitude, we will have that Z(x; «) — 0 rapidly as x increases. Such equations 
are considered well-conditioned, but they may also be troublesome for most of 
the numerical methods of this chapter. These equations are called stiff differential 
equations, and we will return to them later in Section 6.9. 


Systems of differential equations The material of this chapter generalizes to a 
system of m first-order equations, written 


=fi(x, Vyreee> Ym) yi (Xo) = Yi.0 
(6.1.11) 


= f,,(x, Vyreeers Ym) Ym (Xo) =), 
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“This is often made to look like a first-order equation by introducing vector 


notation. Let 


y,(x) filxy) Yio 
y(x)=] : f(x,y) = : You} = | (61.12) 
Ym X) Fal XY) Yn.0 
The system (6.1.11) can then be written as 
y'=f(x,y) — y(xo) = Yo (6.1.13) 


By introducing vector norms to replace absolute values, virtually all of the 
preceding results generalize to this vector initial value problem. 
For higher order equations, the initial value problem 


ye f(xy OO) 


y(xo) = Vo.e--s y'™" (x9) = yer} 


(6.1.14) 


can be converted to a first-order system. Introduce the new unknown functions 
=e = = -1 
A=) “Wea 


These functions satisfy the system 


: Yi = V2 yi(X9) = X% 
Yn = Ys ¥2(Xo) = Xo 
(6.1.15) 
Vanni = Vm 


Vin =L(%,Vas-++s Im) Im Xo) = YS" 
Example The linear second-order equation 
y” = a,(x)y’ + ao(x)y + g(x) y(xo) =a y'(x9) = 8B 
becomes 
P| 0 1 IP}+| 0 | y3 (Xo) -(§| 
y2 A(x) a,(x) }L 2 g(x) ¥o(Xo) B 
In vector form with A(x) denoting the coefficient matrix, 
a 
y= A(x)y + G(x) y(x0) = Yo= [| 


a linear system of first-order differential equations. 
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There are special numerical methods for mth order equations, but these have 
been developed to a significant extent only for m = 2, which arises in applica- 
tions of Newtonian mechanics from Newton’s second law of mechanics. Most 
high-order equations are solved by first convertirg them to an equivalent first- 
order system, as was just described. 


6.2 Euler’s Method 


The most popular numerical methods for solving (6.0.1) are called finite 
difference methods. Approximate values are obtained for the solution at a set of 
grid points 

Xg <X,<xQ< +++ <x, < ee (6.2.1) 
and the approximate value at each x, is obtained by using some of the values 
obtained in previous steps. We begin with a simple but computationally ineffi- 
cient method attributed to Leonhard Euler. The analysis of it has many of the 
features of the analyses of the more efficient finite difference methods, but 
without their additional complexity. First we give several derivations of Euler’s 
method, and follow with a complete convergence and stability analysis for it. We 
give an asymptotic error formula, and conclude the section by generalizing the 


earlier results to systems of equations. 
As before, Y(x) will denote the true solution to (6.0.1): 


¥'(x) = f(x, Y(x)) ¥(x9) = Y (6.2.2) 
The approximate solution will be denoted by y({x), and the values y(x), 
y(X),.--, ¥(x,),--- will often be denoted by yo, y,,-.-, Yar--- - AM equal grid 
size h > 0 will be used to define the node points, 

Xj =Xo+Jh j=0,1,... 

When we are comparing numerical solutions for various values of /, we will also 
use the notation y,(x) to refer to y(x) with stepsize h. The problem (6.0.1) will 
be solved on a fixed finite interval, which will always be denoted by (xo, 5]. The 
notation N(h) will denote the largest index N for which 


XySb  Xy41>6 


In later sections, we discuss varying the stepsize at each x,, in order to control 
the error. - 


Derivation of Euler’s method Euler’s method is defined by 
Yast Int Hf (Xq In) = 0j1,2-. (6.2.3) 


_with yo = Yi Four viewpoints of it are given. 
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2 


Tangent 


| Figure 6.3 Geometric interpretation 
of Euler’s method. 


1. A geometric viewpoint. Consider the graph of the solution Y(x) in Figure 
6.3. Form the tangent line to the graph of Y(x) at x9, and use this line as an 
OO approximation to the curve for xy < x < x,. Then 


! Ay 
= Y'(x9) =f(xo, Yo) 
¥(x) — ¥(xo) = Ay = AY"(x9) 


Y(x1) = ¥(xo) + hf (xo, ¥(x9)) 


formula (6.2.3). 


! 

| By repeating this argument on [x,, x2], [x2, x3],..., we obtain the general 
| 

| 2. Taylor series. Expand Y(x,,,) about ~x,, 


h? 
Y(x,+1) me Y(x,) + hY'(x,) + Zz V"En) Xp Ss gn s Xn+d (6.2.4) 


By dropping the error term, we obtain the Euler method (6.2.3). The term 


i ; 2 


| h 
! | T= ¥"(é) | (6.2.5) 


is called the truncation error or discretization error at x,.,. We use the: 
former name in this text. 
3. Numerical differentiation. | From the definition of a derivative, 


Y(xXn41) = Y(x,) 
h 


| Y¥(Xna1) = Y(x%q) + bf (Xn Yon) 


= Y(x,) =f(x,, Y(x,)) 
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Figure 6.4 [llustration of (6.2.7). 


4, Numerical integration. Integrate Y’(t) = f(t, Y(t)) over [x,, X,411: 


atl 


Y(xqs1) = ¥(x,) + [VG YC) at (6.2.6) 


Consider the simple numerical integration method 
ath 
is g(t) dt = hg(a) (6.2.7) 


called the /eft-hand rectangular rule. Figure 6.4 shows the numerical integral 
as the crosshatched area. Applying this to (6.2.6), we obtain 


Y(x,41) = Y(x,,) + hf (Xn. Y(x,)) 
as before. 


Of the three analytical derivations (2)-(4), both (2) and (4) are the simplest 
cases of a set of increasingly accurate methods. Approach (2) leads to the 
single-step methods, particularly the Runge-Kutta formulas. Approach (4) leads 
to multistep methods, especially the predictor—corrector methods. Perhaps surpris- 
ingly, method (3) often does not lead to other successful methods. The first 
example given of (3) is the midpoint method in Section 6.4, and it leads to 
problems of numerical instability. In contrast in Section 6.9, numerical differenti- 
ation is used to derive a class of methods for solving stiff differential equations. 
The bulk of this chapter is devoted to multistep methods, partly because they are 
generally the most efficient class of methods and partly because they are more 
complex to analyze than are the Runge-Kutta methods. The latter are taken up 
in Section 6.10. 

Before analyzing the Euler method, we give some numerical examples. They 
also serve as illustrations for some of the theory that is presented. 
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Table 6.1. Euler’s method, example (1) 


x Yl) Y(x) Y(x) — yn (x) 
h=2 40 1.44000 1.49182 05182 
80 2.07360 2.22554 15194 
1.20 2.98598 3.32012 33413 
1.60 4.29982 4.95303 65321 
2.00 6.19174 7.38906 1.19732 
h= 1 40 1.46410 1.49182 02772 
.80 2.14359 2.22554 08195 
1.20 3.13843 3.32012 18169 
1.60 4.59497 4.95303 35806 
2.00 6.72750 7.38906 66156 
h = .05 40 1.47746 1.49182 01437 
.80 2.18287 2.22554 04267 
1.20 3.22510 3.32012 09502 
1.60 4.76494 4.95303 .18809 
2.00 7.03999 7.38906 34907 


Example 1. Consider the equation y’ = y, y(0) = 1. Its true solution is Y(x) = 
e*. Numerical results are given in Table 6.1 for several values of h. The answers 
y,(X,,) are given at only a few points, rather than at all points at which they were 
calculated. Note that the error at each point x decreases by about half when is 
halved. 


2. Consider the equation 


1 
‘= —2y? 0) = 6.2. 
YAS ae y(0) (6.2.8) 
The true solution is 
¥(x) = — 
TT 4x 


and the results are given in Table 6.2. Again note the behavior of the error as A is 
decreased. : 


Convergence analysis At each step of the Euler method, an additional trunca- 
tion error (6.2.5) is introduced. We analyze the cumulative effect of these errors. 
The total error Y(x) — y(x) is called the global error, and the last columns of 
Tables 6.1 and 6.2 are examples of global error. 

To obtain some intuition about the behavior of the global error for Euler’s 
method, we consider the very simple problem 


y'=2x (0) =0 (6.2.9) 
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Table 6.2 Eufler’s method, example (2) 


x Yn(x) Y(x) ¥(x) — y, (x) 
h=2 0.00 0.0 0.0 0.0 
40 37631 34483 — 03148 
80 .54228 48780 — 05448 
1.20 52709 .49180 — 03529 
1.60 46632 44944 — .01689 
2.00 40682 .40000 — 00682 
h=1 40 36085 34483 —.01603 
.80 51371 48780 —.02590 
1.20: 50961 49180 —.01781 
1.60 45872 44944 — 00928 
2.00 40419 .40000 — .00419 
h=05 | 40 "35287 34483 — 00804 
80 50049 48780 — 01268 
1.20 .50073 49180 — 00892 
1.60 45425 44944 . — .00481 
2.00 40227 .40000 — 00227 


Its solution is ¥(x) = x. Euler’s method for this problem is 


Vath aa Yn + 2hx,, Yo = 0 


Then it can be verified by the induction that 
Yn = Xn-1% 0 n>1 
For the error, 
¥(x,) — Yn = Xp — XqXn-1 = Ax, (6.2.10) 


Thus the global error at each fixed value of x is proportional to h. This agrees 
with the behavior of the examples in Table 6.1 and 6.2, in which the error 
decreases by about half when h is halved. 

For the complete error analysis, we begin with the following lemma, which i is 
quite useful in the analysis of finite difference methods. 


Lemma lI For any real x, 


1+x <e* 
and for any x > —1, 
0<(1+x)"<e™ (6.2.11) 
Proof Using Taylor’s theorem, 
ya 


exs=1+x+—ef 
2 
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with £ between 0 and x. Since the remainder is always positive, the first 
result is proved. Formula (6.2.11) follows easily. | 


For the remainder of this chapter, it is assumed that the function f(x, y) 
satisfies the following stronger Lipschitz condition: 


[f(x nn) -f( wis Kiln — ml -2<y,<00 = xy<x<b 
(6.2.12) 


for some K > 0. Although stronger than necessary, it will simplify the proofs. 
And given a function f(x, y) satisfying the weaker condition (6.1.1) and a 
solution Y(x) to the initial value problem (6.0.1), the function f can be modified 
to satisfy (6.2.12) without changing the solution Y(x) or the essential character of 
the problem (6.0.1) and its numerical solution. [See Shampine and Gordon 
(1975), p. 24, for the details]. 


Theorem 6.3 Assume that the solution Y(x) of (6.0.1) has a bounded second 
derivative on [x , bj. Then the solution { y,(x,,)|x9 < x, < 5} 
obtained by Euler’s method (6.2.3) satisfies 


Max Yn) — y,(x,)| = eP- x) F leo} ) 


Xo SX, S 


e(b-x0)K os | 
2 


Jom (6.2.13) 


K 
where 
h 
t(h) = ZIP" le (6.2.14) 
and @y) = Yo — y,(Xo).- 
If in addition, 
\¥Yyo—y(xo)|< oh ash-o0 (6.2.15) 


for some c, = 0 (e.g., if Yo = yo for all A), then there is a constant 
B > 0 for which 


Max |Y(x,) — y,(x,)|< BA (6.2.16) 


XpSxX,Sb 


Proof Let e, = Y(x,) — y(x,), n = 0, and recall the definition of N(h) from 
the beginning of the section. Define 


Nl > 


Y"(g,) O<n<N(h)-1 


1, = 
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based on the truncation error in (6.2.5). Easily 


Max tal < t(h) 


O<n<N- 

using (6.2.14). From (6.2.4), (6.2.2), and (6.2.3), we have 
Y,41= Y, + Af(x,, Y,) + Az, (6.2.17) 
Yast =n t A(X qn) OSn<N(A)-1 (6.2.18) 


We. are using the common notation Y, = Y(x,,). Subtracting (6.2.18) - 
from (6.2.17), 


Ona = Ont AL F( Xn Y,) —£(%n5In)] + Am, (6.2.19) 
and taking bounds, using (6.2.12), 
lensil < len] + AK|Y, — yal + Ale, | 
lenai] < (1+ AK )je,| thr(h) O<n<N(h)-1 (6.2.20) 
Apply (6.2.20) recursively to obtain 
len] < (1+ hK)"eo] + {1+ (144K) +--+ +(1 44K)" }hr(h) 


Using the formula for the sum of a finite geometric series, 


r"™—J] 
Der ee Se ee 6221) 
pos 
. we obtain 
1+AhK)"-1 
le,| < (1+ A4K)"jeo] + [oe deca) (6.2.22) 


Using Lemma 1, ~ 
(14+ AK)" < eM = elt *0)K < g(b-*0)K 


and this with (6.2.22) implies the main result (6.2.13). 
The remaining result (6.2.16) is a trivial corollary of (6.2.13) with the 
constant B given by 


eee ‘| Rae 


B = (6-—xg) K + 
ce K 2 


This completes the proof. a 
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The result (6.2.16) implies that the error should decrease by at least one-half 
when the stepsize h is halved. This is confirmed in the examples of Table 6.1 and 
6.2. It is shown later in the section that (6.2.16) gives exactly the correct rate of 
convergence (also see Problem 7). 

The bound (6.2.16) gives the correct speed of convergence for Euler’s method, 
but the multiplying constant B is much too large for most equations. For 
example, with the earlier example (6.2.8), the formula (6.2.13) will predict that 
the bound grows with b. But clearly from Table 6.2, the error decreases with 
increasing x. We give the following improvement of (6.2.13) to handle many 
cases such as (6.2.8). 


Corollary. Assume the same hypotheses as in Theorem 6.3; in addition, assume 
that 


f(x, y) J 


0 -: 
5 (6.2.23) 


for xy < x <b, — 00 <y < oo. Then for all sufficiently small h, 


et, ot 
I¥(x,) — Ya(Xn) |S leol + 5(%_ — Xo) Max [¥"(x)| (6.2.24) 
é 2 XgSx,55 : 
forxg9 <x, <b. | 
Proof Apply the mean value theorem to the error equation (6.2.19): 


2 
Cnai = + pT), + P y(é,) (6.2.25) 
oy 2 


with ¢, between y,(x,) and Y(x,). From the convergence of y,(x) to 
Y¥(x) on [xp,.b], we know that the partial derivatives Of(x,,§,)/9y 
approach df(x, Y(x))/dy, and thus they must be bounded in magnitude 
over [Xp, b]. Pick Ay > 0 so that 


: 5 aL hem Se) 2 


1 ay 


-~1  xy<x,<b (6.2.26) 


for all h < Ay. From (6.2.23), we know the left side is also bounded 
above by 1, for all 4. Apply these results to (6.2.25) to get 


h? : 
lensil $ leal + > 1Y"(E,) (6.2.27) 
By induction, we can show 


h2 
lenl $ leol + S[hY"(éo + +19" WI] 


which easily leads to (6.2.24). . | 
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Result (6.2.24) is a considerable improvement over the earlier bound (6.2.13); 

the exponential exp(K(b — x9)) is replaced by b — xg (bounding x, — x9), 

which increases less rapidly with b. The theorem does not apply directly to the 

earlier example (6.2.8), but a careful examination of the proof in this case will 
show that the proof is still valid. 


Stability analysis Recall the stability analysis for the initial value problem, 
given in Theorem 6.2. To consider a similar idea for Euler’s method, we consider 
the numerical method 


Zna1 =Z_, th[f(x,,2,) + O(x,)] Os<n<N(h)-1 (6.2.28) 


with Zz) = Yo + €. This is in analogue to the comparison of (6.1.5) with (6.0.1), 
showing the stability of the initial value problem. We compare the two numerical 
solutions {z,} and { y,} as h > 0. 

Let e, = Z, — Yq. 2 = 0. Then eg = €, and subtracting (6.2.3) from (6.2.28), 


€ns1 = On + ALS (ms Zn) ~ f(Xn» Ind] + 48(x,) 


This has exactly the same form as (6.2.19). Using the same procedure as that 
following (6.2.19), we have 


. e(o-x)K _ 4 
Max |z,— y,| < e@-*0*je] + | ~—_———_ 1 
9 Max len ~ Jal SEO el eo isis 


Consequently, there are constants k,, k, independent of A, with 


Max |Z, —Yal < kilel + KallSll., (6.2.29) 


O<n<N(h) 


This is the analogue to the result (6.1.4) for the original problem (6.0.1). This says 
that Euler’s method is a stable numerical method for the solution of the initial 
value problem (6.0.1). We insist that all numerical methods for initial value 
problems possess this form of stability, imitating the stability of the original 
problem (6.0.1). In addition, we require other forms of stability as well, which are 
introduced later. In the future we take 5(x) = 0 and consider only the effect of 
perturbing the initial value Y,. This simplifies the analysis, and the results are 
equally useful. 


Rounding error analysis Introduce an error into each step of the Euler method, 
with each error derived from the rounding errors of the operations being 
performed. This number, denoted by p,, is called the local rounding error. Calling 
the resultant numerical values ¥,, we have 


Fnar =Int Hf(%ns In) + Pp = 1 = 041,..., (kA) -1 (6.2.30) 


The values §;, are the finite-place numbers actually obtained in the computer, and 
y, is the value that would be obtained if exact arithmetic were being used. Let 
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‘p(h) be a bound on the rounding errors, 


p(h)= Max fp, (6.2.31) 
O<nsN(h)-1 


In a practical situation, using a fixed word length computer, the bound p(h) does 
not decrease as h 0. Instead, it remains approximately constant, and 
p(A)AIY ||. Will be proportional to the unit roundoff u on the computer, that is, 
the smallest number u for which 1 + u > 1 [see (1.2.12) in Chapter 1]. 

To see the effect of the roundoff errors in (6.2.30), subtract it from the true 
solution in (6.2.4), to obtain 


en +1 =é,+Al[f( (x,, Y,)- f(x J, )] + hr, — Pa» 


where @, = Y(x,) — ¥,- Proceed as in the proof of Theorem 6.3, but identify 
1, — p,/h with 7, in the earlier proof. Then we obtain 


[é,) se" 7*) Y= Jol + — (h) + oth we) (6.2.32) 


To further examine this bound, let p(h)/||Y]|,, = u, as previously discussed. 
Further assume that Y) = Yo. Then 


a te = E(h) (6.2.33) 


. Th 
é, se{—lY"l,, + 
l2,l {5 eo 
The qualitative behavior of E(A) is shown in the graph of Figure 6.5. At A”, 
E(A) is a minimum, and any further decrease will cause an increased error bound 
E(h). 

This derivation gives the worst possible case for the effect of the rounding 
errors. In practice the rounding errors vary in both size and sign. The resultant 
cancellation will cause @, to increase less rapidly than is implied by the 1/h term 
in E(h) and (6.2.32). But there will still be an optimum value of h*, and below it 
the error will again increase. For a more complete analysis, see Henrici (1962, pp 
35-59). 


hk 


Figure 6.5 Error curve for (6.2.33). 


| 
ieianaiasale 
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Usually rounding error is not a serious problem. However, if the desired 
accuracy is close to the best that can be attained because of the computer word 
length, then greater attention must be given to the effects due to rounding. On an 
IBM mainframe computer (see Table 1.1) with double precision arithmetic, if 
h > .001, then the maximum of u/h ‘is 2.2 x 107}, where u is the unit round. 
Thus the rounding error will usually not present a significant problem unless very 
small error tolerances are desired. But in single precision with the same restric- 
tion on hk, the maximum of u/h is 0.5 X 10~4, and with an error tolerance of this 
magnitude (not an unreasonable one), the rounding error will be a more signifi- 
cant factor. 


Example We solve the problem 
y’=-y+2cos(x) y(0)=1 


whose true solution is Y(x) = sin(x) + cos(x). We solve it using Euler’s method, 
with three different forms of arithmetic: (1) four-digit decimal floating-point 
arithmetic with chopping; (2) four-digit decimal floating-point arithmetic with 
rounding; and (3) exact, or very high-precision, arithmetic. In the first two cases, 
the unit rounding errors are uv = .001 and u = .0005, respectively. The bound 
(6.2.32) applies to cases 1 and 2, whereas case 3 satisfies the theoretical bound 
(6.2.24). The errors for the three forms of Euler’s method are given in Table 6.3. 
The errors for the answers obtained using decimal arithmetic are based on the 
true answers Y(x) rounded to four digits. 

For the case of chopped decimal arithmetic, the errors are beginning to be 
affected with A = .02; with 4 = .01, the chopping error has a significant effect on 


Table 6.3 Example of rounding effects in Euler’s method 


Chopped Rounded ‘ Exact 
h x Decimal Decimal Arithmetic 
.04 1 —1.00E — 2 —1.70E — 2 —1,70E — 2 
2 ~1lI7E — 2 1.83E — 2 —1.83E — 2 
3 —1.20E — 3 —2.80E — 3 —2.78E — 3 
4 1.00E — 2 1.60E — 2 1.53E ~ 2 
5 1.13E — 2 1.96E — 2 1.94E — 2 
.02 1 7.00E — 3 -900E-3 | —8.46E.— 3 
2 4.00E — 3 —9.10E — 3 —9.13E — 3 
3 2.30E — 3 —1.40E — 3 —140E — 3 
4 -6.00E-3 . 8.00E — 3 7.62E — 3 
5 —6.00E — 3 8.50E — 3 9.63E — 3 
OL 1 2.80E — 2 —3.00E — 3 —4.22E — 3 
2 2.28E — 2 —4.30E — 3 —4.56E — 3 
3 7.40E — 3 —4.00E — 4 —7.03E — 4 
4 —2.30E — 2 3.00E — 3 3.80E — 3 
5 —241E — 2 4.60E — 3 | 481E — 3 
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‘the total error. In contrast, the errors using rounded arithmetic are continuing to 


decrease, although the h = .01 case is-affected slightly. With this problem, as with 
most others, the use of rounded arithmetic is far superior to that of chopped 
arithmetic. 


Asymptotic error analysis An asymptotic estimate of the error in Euler’s 
method is derived, ignoring any effects due to rounding. Before beginning, some 
special notation is necessary to simplify the algebra in the analysis. If B(x, h) isa 
function defined for x) < x < b and forall sufficiently small 4, then the notation 


B(x, h) = O(h?) 
for some p > 0, means there is a constant c such that 
[B(x,h)}<ch?— xgsx<b (6.2.34) 


for all sufficiently small 4. If B depends on / only, the same kind of bound is 
implied. 


Theorem 6.4 Assume Y(x) is the solution of the initial value problem (6.0.1) 
and that it is three times continuously differentiable. Assume 


af (x, d*f(x, 
feos FEY 56,92) 


are continuous and bounded for xg < x < b, —w <y < oo. Let 
the initial value y,(x 9) satisfy 


Yo — y,(Xo) = Sgh + O(h?) | (6.2.35) 


Usually this error is zero and thus 8) = 0. 
Then the error in Euler’s method (6.2.3) satisfies 


Y(x,) — y,(x,) = D(x,)h + O(h?) (6.2.36) 
where D(x) is the solution of the linear initial value problem 
D(x) = f(x, Y(x))D(x) + 4Y"(x) — D(xq) = 6) (6.2.37) 


Proof Using Taylor’s theorem, 


h? h? 
¥(xqa1) = Y(%q) + AY(a,) + TV" (On) + EVER) 


for some x, < £, < X,4- Subtract (6.2.3) and use (6.2.2) to obtain 


Cnt = On - A[ f(x, Y,) — f(xy. yn) 


WP h? 
+ a Y"(x,) + = Y(E,) (6.2.38) 
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‘Using Taylor’s theorem on /(x,, y,), Tegarded as a function of y,, 
2 
ieee Yn) = f(Xqs Y,) t (yn a Yi Ces Y,) + ay; = ¥,) Toa se) 


for some ¢, between y, and Y,. Using this in (6.2.38), 


h2 
enar = [1 + Af,(x,, Y, Jen + = Y"(x,) + B, 


h3 1 
B, = & yee, ) = oy Xn a Ya (6.2.39) 


Using (6.2.16) 
B, = O(h?) (6.2.40) 


Because B, is small relative to the remaining terms in (6.2.39), we find 
the dominant part of the error by neglecting B,. Let g, represent the 
dominant part of the error. It is defined implicitly by 


he 
Brar= [1+ Ayn ¥e)] Bn + ZY") (6.2.41) 

with , 
89 = Soh (6.2.42) 


the dominant part of the initial error (6.2.35). Since we expect Bn = ens 
and since e, = O(h), we introduce the sequence {6,} implicitly by 


g, = hé, (6.2.43) 


Substituting into (6.2.41), canceling 4 and rearranging, we obtain 


8,41 -_ 6, + hl f, (qs Y,)6, + iy”(x,)| Xo s Xn s b (6.2.44) 


/ 
The initial value 6, from (6.2.35) was defined as independent of h. 
The equation (6.2.44) is Euler’s method for solving the initial value 
problem (6.2.37); thus from Theorem 6.3, 


D(x,)-6,=O(h) xy9<x,<6 (6.2.45) 
Combining this with (6.2.43), 
g, = D(x,)h + Oh?) (6.2.46) 


To complete the proof, it must be shown that g, is indeed the 
principal part of the error e,. Introduce 


k, =e, — 8, (6.2.47) 


Sao SIS oe 
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Then ky = ey — 89 = O(h”) from (6.2.35) and (6.2.42). Subtracting 
(6.2.41) from (6.2.39), and using (6.2.40), 


Knar = [1+ Af,(x,, Y,)|k, + B, 
\Knail < (1 + AK )IK,| + O(F?) 


This is of the form of (6.2.20) in the proof of Theorem 6.3, with the term 
hr(h) replaced by O(h?). Using the same derivation, 


|k,| = O(h?) (6.2.48) 
Combining (6.2.46)--(6.2.48), 
Cn = Bn + ky = [AD(x,) + O(h7)] + O(h?) 
which proves (6.2.36). | 
- The function D(x) is rarely produced explicitly, but the form of the error 
(6.2.36) furnishes useful qualitative information. It is often used as the basis for 
extrapolation procedures, some of which are discussed in later sections. 
Example Consider the problem 
yeay yO=1 
with the solution Y(x) = e~*. The equation for D(x) is 
D(x) = —D(x)+4e7* D(0)=0 
and its solution is 


D(x) = }xe7* 


This gives the following asymptotic formula for the error in Euler’s method. 
h 
Y(xq) — Ya %q) = 5 Xue (6.2.49) 


Table 6.4 contains the actual errors and the errors predicted by (6.2.49) for 
h = .05. Note then the error decreases with increasing x, just as with the solution 
Y(x). But the relative error increases linearly with x, 


¥(x,) —Yalx,) . A 
Y(x,) get 


Also, the estimate (6.2.49) is much better than the bound given by (6.2.13) in 
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Table 6.4 Example (6.2.49) 


of theorem 6.4 

Xn ae * hD(x,) 
4 .00689 .00670 
8 .00920 .00899 

1.2 .00921 .00904 

1.6 .00819 .00808 


2.0 .00682 .00677 


Theorem 6.3. That bound is 


h 
I¥(x,) — s(x) 1s 5(e*- 1) 
and it increases exponentially with x,. 


Systems of equations To simplify the presentation, we will consider only the 
following system of order 2: 


YW=hO Are) Nl Xo) = Nilo 
(6.2.50) 


¥3 =fy(% Ys Yr) Yo(Xo) = "N09 


The generalization to higher order systems is straightforward. Euler’s method for 
solving (6.2.50) is 


Vijnt1 ~Jizn t hf, (x,5 Vi,no Yon) 
(6.2.51) 


Yo,n+1 = Yan t hf (xz; Yi, n2 Ya,n) 


a clear generalization of (6.2.3). 

All of the preceding results of this section will generalize to (6.2.51), and the 
form of the generalization is clear if we use the vector notation for (6.2.50) and 
(6.2.51), as given in (6.1.13) in the last section. Use . 


y'=f(x,y) — y(x9) = Yo 
In place of absolute values, use the norm (1.1.16) from Chapter 1: 
llyll = Max|y,] 


To generalize the Lipschitz condition (6.2.12), use Taylor’s theorem (Theorem 
1.5) for functions of several variables to obtain 


IIf(x,z) — f(x,y) < Kz — yl (6.2.52) 
df,(x, Wi, w,) 
aw, 


d 


K= Max )) Max 
j XpSx<b 
— 00 <wW,, HW) < 00 


(6.2.53) 
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The role of df(x, y)/dy is replaced by the Jacobian matrix 


ah ah 
dy, dy, 
{,(x,y) = 2. 
y(~y) ah of, (6.2.54) 
dy dy, 


As an example, the asymptotic error formula (6.2.36) becomes 

¥(x,) — Yale.) = AD(x,) +R, Rall = O(h?) (6.2.55) 
with D(x) the solution of the linear system 

D(x) = f, (x, ¥(x))D(x) + ¢¥"(x) D(x) = (6.2.56) 


using the preceding matrix f,(x, y). 


Example Solve the pendulum equation, 
T 
9%(1) = ~sin(6(1)) (0) = = 6"(0) = 0 


Convert this to a system by letting y, = 6, y, = @’, and replace the variable ¢ by 
x. Then 


, T 
yi = y2 y,(0) = 2 


ys=—sin(y,) — y,(0) =0 (6.2.57) 


The numerical results are given in Table 6.5. Note that the error decreases by 
about half when is halved. 


Table 6.5 Euler’s method for example (6.2.57) 


h Xn Yivn Yn) Error 2,0 ¥,(x,) Error 
0.2 2 1.5708 1.5508 — .0200 — .20000 ~ 199999 000001 
, 6 1.4508 1.3910 — 0598 —.59984 | —.59806 -00178 
1.0 1.1711 1.0749 — .0962 — .99267 — 97550 01717 
0.1 2 1.5608 1.5508 — .0100 — .20000 ~—- .199999 000001 
6 1.4208 1.3910 ~— 0298 ~ 59927 — .59806 00121 


1.0 1.1223 1.0749 — .0474 — 98568 — .97550 01018 
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6.3 Multistep Methods 


This section contains an introduction to the theory of multistep methods. Some 
particular methods are examined in greater detail in the following sections. A 
more complete theory is given in Section 6.8. 

As before, let h > 0 and define the nodes by x, =x, +nh, n> 0. The 
general form of the multistep methods to be considered is 


P P 
Yarr= ajith LOS (%n-j Mey) EP (6.3.1) 
j=0 * 


jer 


The coefficients do,..., ay, b_, bo,---; b, are constants, and p > 0. If either 
a, #0 or b, + 0, the method is called a p+ 1 step method, because p + 1 
previous solution values are being used to compute y,,,. The values y,,..., y, 
must be obtained by other means; this is discussed in later sections. Euler’s 
method is an example of a one-step method, with p = 0 and 


ag=1 &=1 5.,=0 


If b_, = 0, then y,,, occurs on only the left side of equation (6.3.1). Such 
formulas are called explicit methods. If b_, # 0, then y,.., is present on both 
sides of (6.3.1) and the formula is called an implicit method. The existence of the 
solution y,,,, for all sufficiently small 4, can be shown by using the fixed-point 
theory of Section 2.5. Implicit methods are generally solved by iteration, which 
are discussed in detail for the trapezoidal method in Section 6.5. 


Example 1. The midpoint method is defined by 
Yn+1 ee | + 2Af(x,; Yn) n2 1 (6.3.2) 


and it is an explicit two-step method. It is examined in Section 6.4. 


-2. The trapezoidal method is defined by 


h 
Init. bald ZUG Yn) tI Xiaas sone | n2 0 (6.3.3) 


It is an implicit one-step method, and it is discussed in Sections 6.5 and 6.6. 
For any differentiable function Y(x), define the truncation error for integrat- 


ing Y’(x) by 
P P 
TAY) = Yin) —| Day) th DS bY] mee 634) 
j=0 j=-l 
Define the function 7,(Y) by 


Y= 270) (635) 
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In order to prove the convergence of the approximate solution { y,|x 9 < x, < b} 
of (6.3.1) to the solution Y(x) of the initial value problem (6.0.1), it is necessary 
to have 

t(h)= Max |4(Y)|>0 as h-O (6.3.6) 


Xp SX, <6 


This is often called the consistency condition for method (6.3.1). The speed of 
convergence of the solution { y, } to the true solution Y(x) is related to the speed 
of convergence in (6.3.6), and thus we need to know the conditions under which 


t(h) = O(h™) (6.3.7) 


' for some desired choice of m = 1. We now examine the implications of (6.3.6) 


and (6.3.7) for the coefficients in (6.3.1). The convergence result for (6.3.1) is 
given later as Theorem 6.6. 


Theorem 6.5 Let m2 1 be a given integer. In order that (6.3.6) hold for all 
continuously differentiable functions Y(x), that is, that the method 
(6.3.1) be consistent, it is necessary and sufficient that . 


Lal ~ } ja,;+ ps b= (6.3.8) 


| And for (6.3.7) to be valid for all functions Y(x) that are m +1 
times.continuously differentiable, it is necessary and sufficient that 
(6.3.8) hold and that 


P P 
Lv (-s)'a,+i D (-f)B=1 0 1 =2,...,m (6.3.9) 
j=0 jar 


Proof Note shat 
T,(aY + BW) = oF (Y) + BT,(W) (6.3.10) 


for all constants «, 8 and all differentiable functions Y, W. To examine 
the consequences of (6.3.6) and (6.3.7) expand Y(x) about x,, using 
Taylor’s theorem 1.4, to obtain 

ae | 


Y(x) = Lae - Xn) YOCx,) + Raai(x) (6.3.11) 


assuming Y(x) is m+ 1 times continuously differentiable. Substituting 
into (6.3.4) and using (6.3.10), 
m 1 : i 
FY) = Yo HV O(x,) T(x — 20)') + Te(Raver) 
imo? 


It is necessary to calculate T,((x — x,)‘) for i > 0. 
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For i = 0, 


T,(1) =c¢9=1— Ya, (6.3.12) 


For i> 1, 


T,((x ~ x,)') 
= (Xn41 — ¥n) pr Ax = x). +h 5 bji(x,-; ~ ae 


Jj=0 pune 


= hi (6.3.13) 
ant [E i's +i x (i) i>1 


This gives 


T,(Y) = x - mi YO(x,) + T,(Rmaid (6.3.14) 


If we write the remainder R,,,,(x) as 


eee Coe x MAYEN) + 


then 


m+ m+ly(m+1) m+2 
Ty(Rmer) = Toate inehm Vem (x,) + OHM) (63.1) 


assuming Y is m + 2 times differentiable. 

To obtain the consistency condition (6.3.6), we need r(h) = O(h), 
and this requires T,(Y) = O(h). Using (6.3.14) with m = 1, we must 
have co, c, = 0, which gives the set of equations (6.3.8). In some texts, 
these equations are referred to as the consistency conditions. To obtain 
(6.3.7) for some m > 1, we must have 7,(Y) = O(h™*?). From (6.3.14) 
and (6.3.13), this will be true if and only if c; = 0, i = 0,1,..., m. This 
proves the conditions (6.3.9) and completes the proof. a 


The largest value of m for which (6.3.7) holds is called the order or order of 
convergence of the method (6.3.1). In Section 6.7, we examine the deriving of 
methods of any desired order. 

We now give a convergence result for the solution of (6.3.1). Although the 
theorem will not include all the multistep methods that are convergent, it does 
include most methods of current interest. ‘Moreover, the proof is much easier 
than that of the more general Theorem 6.8 of Section 6.8. 
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Theorem 6.6 Consider solving the initial value problem 
~y=f(x,y) yx) =%y xy <x<b 
using the multistep method (6.3.1). Let the initial errors satisfy 


n(h) = Max |¥(x;)—y,(x,)| 20 as  h70 (6.3.16) 
Osisp 


Assume the method is consistent, that is, it satisfies (6.3.6). And 


finally, assume that the coefficients 2, are all nonnegative, 


a>0 jf=0,1,...,p (6.3.17) 


jz 
Then the method (6.3.1) is convergent, and 


Max | Yix,) — y,(x,) | Sem) + crt) (6.3.18) 


MS.n2b 


for suitable constants c,, cz. If the method (6.3.1) is of order m, 
and if the initial errors satisfy n(4) = O(h”), then the speed of 


convergence of the method is O(h”™). 


Proof Rewrite (6.3.4) and use Y'(x) = f(x, Y(x)) to get 


(w= SaGeI ea y BI ey) 


j=0 j=-l 


Subtracting (6.3.1), and using the notation e,; = Y(x;) — y,, 


P P 
eat — e aje,_;+h = b | f(x,» ¥.3) 


j=0 j=nl 


—f(Xq-j+ Yn-j)| + ha,(Y) 


Apply the Lipschitz condition and the assumption (6.3.17) to obtain 


P P 
lenail < L ale +hK d i je,-j| + At(h) 
in jen 


Introduce the following error bounding function, 


f, = Max le,| n=0,1,..., N(A) 
Osisn 
Using this function, 


P PB 
lenail S$ DL ajf,+hK LY bilfiar + At(A) 


j=0 j=rl 
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and applying (6.3.8), 


P 
lenail Sf, + hehe + hr(h) c=K 3 15| 
j=l 


The right side is trivially a bound for f,, and thus 
frti Sf, + heft a hr(h) 
For Ac < 4, which must be true as h > 0, | 


te 
1- hc 


h 
fai < + nis t(h) 
l-Ae 


< (1 + 2hce) f, + 2hr(h) 


Noting that /, = n(h), proceed as in the proof of Theorem 6.3 following 
(6.2.20). Then 


e26(b— x9) -1 
fa < 2 -O)n(h) + [ee by Xp <x, <5 (6.3.19) 


This completes the proof. | 


The conclusions of the theorem can be proved under weaker assumptions; in 
particular, (6.3.17) can be replaced by a’much weaker assumption. These results 
are given in Section 6.8. To obtain a rate of convergence of O(h”) for the 
method (6.3.1), it is necessary that each step has an error 


T,(Y) = O(a™*") 


But the initial values yo,..., y, need to be computed with only an accuracy of 
O(h”™), since 7(h) = O(h”™) is sufficient in (6.3.18). Examples illustrating the use 
of a lower order method for generating the initial values yo,..., y, are given in 
the following sections. , 

The result (6.3.19) can be improved somewhat for particular cases, but the 
speed of convergence will remain the same. Examples of the theorem are given in 
the following sections. As with Euler’s method, a complete stability analysis can 
be given, including a result of the form (6.2.29). The proof is a straightforward 
modification of the proof of Theorem 6.6. Also, an asymptotic error analysis can 
be given; examples are given in the next two sections. 


6.4 The Midpoint Method 


We will define and analyze the midpoint method, using it to illustrate some ideas 


‘not possible with Euler’s method. We can derive the midpoint method in several 


ways, as with Euler’s method, and we use numerical differentiation here. 
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From (5.7.11) of Chapter 5, we have 


at+h)—g(a—h h? 
g’(a) = eee 8 %(E) 


for some a—h <&<a+h. Applying this to. 


¥'(x,) = f(x, ¥(x,)) 


we have 


Y(xX,01) — Y¥(%,-1) Ah? 3 7 
ee ra (£,) = f(x,, ¥(x,)) 


with x,_, < &, < x,4,- Solving for Y(x,41), 
¥(xXpa1) = Y(%q-1) + 2Af (qs Y(x,)) + APYOUCE,) (6.4.1) 
The midpoint method is obtained by dropping the last term: 


Yast =In—1 t+ 2Af(X,, IY) Bd (6.4.2) 


It is an explicit two-step method, and the order of convergence is two. The value 
of y, must be obtained by another method. 

The midpoint method could also have been obtained by applying the midpoint 
numerical integration rule (5.2.17) to the following integral reformulation of the 
differential equation (6.0.1): 


n+ 


¥(Xpa1) = Y%pa) +f F(s, Y(t) at (6.4.3) 


-1 


We omit the details. This approach is used in Section 6.7, to obtain other 


multistep methods. 
To analyze the convergence of (6.4.2), we use Theorem 6.6. From (6.4.1), 


(6.3.4), (6.3.5), we easily obtain that 
1,(Y) a tn?yO(E,) : Xn-1 s Sn s Xn-1 (6.4.4) 


An improved version of the proof of Theorem 6.6, for the midpoint method, 
yields 


e2K(b— x0) -—j 1 , 
Max |¥( 2) ~ yy Ce) < e2#0-*99(4) + |Z — | aaron, | 


XoSX,56 


(6.4.5) 


n(h) = Max{| Y — yols L¥(44) — Yar) I} 
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Assuming yo = Yo for all A, we need to have Y(x,) — y,(%)) = O(h?) in 
order to have-a global order of convergence O(h7) in (6.4.5). From (6.2.4), a 
single step of Euler’s method has the desired property: 


Yi =Yot hf(xo,%) YM=N% (6.4.6) 
h 
Y(x,)-y,= zr") xy sfsx, (6.4.7) 


With this initial value y,(x,) = y,, the error result (6.4.5) implies 


Max, |¥(%,) ~ ya) = O02) (6.4.8) 


Xo SX,S 


A complete stability analysis can be given for the midpoint method, parallel- 
ing that given for Euler’s method. If we assume for simplicity that n(h) = O(h?), 
then we have the following asymptotic error formula: 


¥(x,) — Ya(x,) = D(x,)h? + OCh>) Xp <x, <5 


D’ =f,(x,¥(x))D+4YP(x) D(x) = 0 (6.4.9) 


There is little that is different in the proofs of these results, and we dispense with 
them. 


Weak stability As previously noted, the midpoint method possesses the same 
type of stability as that shown for Euler’s method in (6.2.28) and (6.2.29). This is, 
however, not sufficient for practical purposes. We show that the midpoint method 
is unsatisfactory with respect to another sense of stability yet to be defined. 

We consider the numerical solution of the problem 


y=hy yO)=1 (6.4.10) 


which has the solution ¥(x) = e**. This is used as a model problem for the more 
general problem (6.0.1), an idea that we explain in Section 6.8. At this point, it is 


‘sufficient to note that if a numerical method performs badly for a problem as 


simple as (6.4.10), then such a method is unlikely to perform well for other more 
complicated differential equations. 
The midpoint method for (6.4.10) is 


Ynt1 = In-1 t 2hdy, = nel . (6.4.11) 


We calculate the exact solution of this equation and compare it to the solution 
Y(x) = e4*. The equation (6.4.11) is an example of a linear difference equation of 
order 2. There is a general theory for pth-order linear difference equations, 
paralleling the theory for pth-order linear differential equations. Most methods 
for solving the differential equations have an analogue in solving difference 
equations, and this is a guide in solving (6.4.11). We begin by looking for linearly 
independent solutions of the difference equation. These are then combined to 
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form the general solution. For a general theory of linear difference equations, see 


Henrici (1962, pp. 210-215). 
In analogy with the exponential solutions of linear differential equations, we 
assume a solution for (6.4.11) of the form 


Demers (6.4.12) 
for some unknown r. Substituting into (6.4.11) to find necessary conditions on r, 


rvtl =p" la 2AAr" 


Canceling r”~}, 


r?=142hAr (6.4.13) 


This argument is reversible. If r satisfies the quadratic equation (6.4.13), then 


(6.4.12) satisfies (6.4.11). 
The equation (6.4.13) is called the characteristic equation for the midpoint 


method. Its roots are ; 
r= hd + Vi th? 7, =hA- V1 + Pr? (6.4.14) 

The general solution to (6.4.11) is then 
Yn = Boro + Bry n=O (6.4.15) 


The coefficients 8, and f, are chosen so that the values of y) and y, that were 
given originally will agree with the values calculated using (6.4.15): 


By + By = Yo 
Bory + Bir, = Yi 


The general solution is 


B _ 1 7 AYo pe Yoo — Vi 
OD tate Woe eae | 


To gain some intuition for these formulas, consider taking the exact initial © 
values 
_ = pAh 
Y=1 yore 
Then using Taylor’s theorem, 


erht — p, 
= —————— = 1+ O(F’’) 
Bo avi + A? ( 

hm- Ak 


B, 
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From these values, 8) > 1 and £, ~ 0 as h ~ 0. Therefore in the formula 
(6.4.15), the term Bor{' should correspond to the true solution e***, since the term 
Br > Oas h > 0. In fact, 


rt = e*[1 + O(h)] (6.4.17) 


whose proof we leave to the reader. 

To see the difficulty in the numerical solution of y’=Ay using (6.4.15), 
examine carefully the relative sizes of rg and r,. We consider only the case of real 
A. For 0 < A < 0, for all h, 


%> |r| > 0 


Thus the term rj’ will increase less rapidly than rj’, and the correct term in the 
general solution (6.4.15) will dominate, namely Byrj’. 
However, for — 0 <A <0, we will have 


0<yH<1 m<-l h>0 


As a consequence, 8,7," will eventually dominate 87," as n increases, for fixed h, 
no matter how small h is chosen initially. The term Borji 0 as n > co; 
whereas, the term £,r]' increases in magnitude, alternating in sign as n increases. 

The term 8,rj' is called a parasitic solution of the numerical method (6.4.11), 
since it does not correspond to any solution of the original differential equation 
y’ = Ay. This original equation has a one-parameter family of solutions, depend- 
ing on the initial value Yj, but the approximation (6.4.11) has the two-parameter 
family (6.4.15), which is dependent on y) and y,. The new solution Br] is a 
creation of the numerical method; for problem (6.4.10) with A < 0, it will cause 
the numerical solution to diverge from the true solution as x, — oo. Because of 
this behavior, we say that the midpoint method is only weakly stable. 

We return to this topic in Section 6.8, after some necessary theory has been 
introduced. We generalize the applicability of the model problem (6.4.10) by 
considering the sign of 0f(x, Y(x))/dy. If it is negative, then the weak instability 
of the midpoint method will usually appear in solving the associated initia] value - 
problem. This is illustrated in the second example below. 


Example 1. Consider the model problem (6.4.10) with A = —1. The numerical 
results are given in Table 6.6 for h = .25. The value y, was obtained using 
Euler’s method, as in (6.4.6). From the values in the table, the parasitic solution is 
clearly growing in magnitude. For x, = 2.25, the numerical solution y, becomes 


" negative, and it alternates in sign with each successive step. 


2. Consider the problem 


y=x-y? — y(0)=0 


The solution &(x) is strictly increasing for x > 0; for large x, Y(x) = yx. 
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Table 6.6 Example 1 of midpoint method instability 


x. Vp Y(x,) Error 
25 -7500 .7788 0288 
.50 -6250 .6065 — 0185 
75 _ 4375 4724 .0349 
1.00 4063 3679 —~ 0384 
1.25 2344 .2865 0521 
1.50 2891 2231 — .0659 
1.75 0898 .1738 0839 
2.00 2441 1353 — 1088 
2.25 — .0322 1054 1376 


Table 6.7 Example 2 of midpoint method instability 


Xn ae Y(x,) Error 
25 0.0 .0312 0312 
50 1250 .1235 ~ 0015 
15 2422 .2700 .0278 
1.00 4707 4555 —.0151 
1.25 6314 6585 0271 
1.50 8963 8574 — .0389 
1.75 9797 1.0376 .0579 
2.00 1.2914 1.1936 — .0978 
2.25 1.1459 1.3264 181 
2.50 1.7599 1.4405 — 319 
2.75 8472 1.5404 693 
3.00 2.7760 1.6302 -1.15 
3.25 — 1.5058 1.7125 3.22 


Although it is increasing, 


af(x,y) | 
dy 
Therefore, we would expect that the midpoint method would exhibit some 
instability. This is confirmed with the results given in Table 6.7, obtained with a 
stepsize of h = .25. By x,, = 2.25, the numerical solution begins to oscillate, and 
at x, = 3.25, the solution y, becomes negative. 


-2y <0 for y>O 


6.5 The Trapezoidal Method 


We use the trapezoidal method to introduce implicit methods and ideas associ- 
ated with them. In addition, the trapezoidal method is of independent interest 
because of a special stability property it possesses. To introduce the trapezoidal 
tule, we use numerical integration. 
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Integrate the differential equation Y’(t) = f(t, Y(t)) over [x,,, X,41] to obtain 
Xnak 
¥(xqa1) = ¥(x,) + fF (4, ¥(2)) at 
Xp 


Apply the simple trapezoidal rule, (5.1.2) and (5.1.4), to obtain 


Y(x,41) = Y(x,) + + [/l% Y(x,)) BE Acree ¥(x,41))] 


(6.5.1) 


h 
_~—ye 


for some x, <&, <x, ,,. By dropping the remainder term, we obtain the 
trapezoidal method, 


h 
Yn+1 Yn + zi Yn) # (44493 Yaoi) ne 0 (6.5.2) 


It is a one-step method with an O(h”) order of convergence. It is also a simple 
example of an implicit method, since y,,, occurs on both sides of (6.5.2). A 
numerical example is given at the end of this section. 


Iterative solution The formula (6.5.2) is a nonlinear equation with root y,.1, 
and any of the general techniques of Chapter 2 can be used to solve it. Simple 
linear iteration (see Section 2.5) is most convenient and it is usually sufficient. Let 
y®, be a good initial guess of the solution y,,,, and define 


; h . : 
yt? =Yn + 5 lia yi) + f(Xpar yP)I J = 0,1, die (6.5.3) 


The initial guess is usually obtained using an explicit method. 
To analyze the iteration and to determine conditions under which it will 
converge, subtract (6.5.3) from (6.5.2) to obtain 


h : 
Yat yo = 5 flan Yas) 7 flee: y¥Ds)] (6.5.4) 


Use the Lipschitz condition (6.2.12) to bound this with 


AK eet 
[ynzi WI? | < [yer WP] 720 (6.5.5) 
If 
AK 
> < 1 (6.5.6) 


then the iterates y, will converge to y,,, as j — oo. A more precise estimate 


n 


368 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS 


- of the convergence rate is obtained from applying the mean value theorem to 


(6.5.4): 
Yn+1 — yt = sb ner Your) Maat — yD] (6.5.7) 


Often in practice, the stepsize h and the initial guess y®, are chosen to ensure 
that only one iterate need be computed, and then we take y,,, = y,2),. 

The computation of y,,, from y, contains a truncation error that is O(h?) 
[see (6.5.1)]. To maintain this order of accuracy, the eventual iterate y,, which 
is chosen to represent y,,,, should satisfy |y,., — y{?,| = O(h?). And if we 


‘want the iteration error to be less significant (as we do in the next section), then 


y{i), should be chosen to satisfy 
[Poor — W621 | = O(n4) (6.5.8) 


To analyze the error in choosing an initial guess y,,,, we must introduce the 
concept of local solution. This will also be important in clarifying exactly what 
solution is being obtained by most automatic computer programs for solving 
ordinary differential equations. Let u,(x) denote the solution of y’ = f(x, y) 
that passes through (x,, y,): 


u(x) =f(x,4,(%)) unl 2,) =In (6.5.9) 


At step x,, knowing y,, it is u,(x,,,,) that we are trying to calculate, rather than 
Y¥(x,41): 
Applying the derivation that led to (6.5.1), we have 


h he 
= ug (nat) = Int SCs In) +L Enars Molnar] — FeAOCER) (65.10) 


for some x, < &, < X,4,- Let &,4; = U,(%,41) — Yagi, Which we call the local 
error in computing y,,, from y,. Subtract (6.5.2) from the preceding to obtain 


h3 
Eni = FUP (een u ee Ae Aee Yaad] a pun Cb ) 


h 3 
= 5 Sy ener InvrEnvi + OME a1) — Fz Hen) + OCA*) 


where we have twice applied the mean value theorem. It can be shown that for all 
sufficiently small h, 


é4 OW) 
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More precisely, 
3 


h 71 h 
fsa [P= F5engead] | EaPe) + 0005] 


3 : 
Un Xn+1) Int. = pun (xn) + O(h*) (6.5.11) 


This shows that the local error is essentially the truncation error. 
If Euler’s method is used to compute y©,, 


pee = Yn + hf (x, a) (6.5.12) 


then u,(x,,,) can be expanded to show that 
2 


h 
Un(Xn+1) aah Pe as zum (Sn) Xn = , Ss Xn+l (6.5.13) 

Combined with (6.5.11), 
Yna1 — Invi = OCh*) (6.5.14) 


To satisfy (6.5.8), the bound (6.5.5) implies two iterates will have to be computed, 
and then we use y,®), to represent y, +1. 


n 


Using the midpoint method, we can obtain the more accurate initial guess 
Dae =SVn-i + 2hf (x, Vn) (6.5.15) 


To estimate the error, begin by using the derivation that leads to (6.4.1) to obtain 


3 
taper) = Mal yea) + 2M (ps yt)) + SUC) 


for some x,_; < 9), < X,41. Subtracting (6.5.15), 
; h3 
u,(X,41) et ee = u,(X,-1) Ini + gue (a) 


The quantity u,,(x,_1) — y,—; can be computed in a manner similar to that used 
for (6.5.11) with about the same result: 


3 | 
Un(Xq—1) Ina = 75 4a (Xn) + OCh*) (6.5.16) 


Then 
é 5h} 
Un(%nea) ~ 201 = Sw (x,) + O(44) 


And combining this with (6.5.11), 


3 
Yn+1 ~ Ine = Zn (xq) + OCA") (6.5.17) 
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With the initial guess (6.5.15), one iterate from (6.5.3) will be sufficient to satisfy 
(6.5.8), based on the bound in (6.5.5). 

The formulas (6.5.12) and (6.5.15) are called predictor formulas, and the 
trapezoidal iteration formula (6.5.3) is called a corrector formula. Together they 
form a predictor—corrector method, and they are the basis of a method that can 
be used to control the size of the local error. This is illustrated in the next section. 


Convergence and stability results The convergence of the trapezoidal method is 
assured by Theorem 6.6. Assuming Ak < 1, 


Max |¥(x,) — (%,)] < e7*O-* eo] 


XqS%,S6 
ekib-w — PIT A? 
+ an eer. | (6.5.18) 


The derivation of an asymptotic error formula is similar to that for Euler’s 
method. Assuming e) = 5)h? + O(h>), we can show 


¥(x,) — Ya(x,) = D(x,)h? + O(h*) 
(6.5.19) 


D(x) = f(x, Ys) D(x) — GY(x) Dlx») = bo 


The standard type of stability result, such as that given in (6.2.28) and (6.2.29) for 
Euler’s method, can also be given for the trapezoidal method. We leave the proof 
of this to the reader. 

As with the midpoint method, we can examine the effect of applying the 
trapezoidal rule to the model equation 


y=aAy y(0)=1 (6.5.20) 


whose solution is Y(x) = e**. To give further motivation for doing so, consider 
the trapezoidal method applied to the linear equation 


y=hy+a(x)  y(0) = % (6.5.21) 
namely 
h 
Yn+1 = In + 7 lANn ay g(x,) + AYn+1 + 8(Xn41)] n=O (6.5.22) 
with yy = Then consider the perturbed numerical method 


h 
enti zy + zl + g(x,,) + AZn4t + 8(Xn+1)] nz=0 
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with z) = Yo + ¢. To analyze the effect of the perturbation in the initial value, let 
w, = Z, — Yq Subtracting, é 


h 
Wis) = W, of zl Aw, az | n=O Wo = € (6.5.23) 


This is simply the trapezoidal method applied to our model problem, except that 
the initial value is « rather than 1. The numerical solution in (6.5.23) is simply « 
times that obtained in the numerical solution of (6.5.20). Thus the behavior of the 
numerical solution of the model problem (6.5.20) will give us the stability 
behavior of the trapezoidal rule applied to (6.5.21). 

The model problems in which we are interested are those for which X is real 
and negative or 4 is complex with negative real part. The reason for this choice is 
that then the differential equation problem (6.5.21) is well-conditioned, as noted 
in (6.1.8), and the major interesting cases excluded are 4 = 0 and AX strictly 
imaginary. 

Applying the trapezoidal rule to (6.5.20), 


hd 
Yn =Vq t a BA + nai) = Y= 1 


Then 


_ 1 + (hA/2) ; 
Vn+1 = T—(ha/d) Yn ne 


Inductively 


| ] hd/2) |" 
Vn = eS n>O0 (6.5.24) 


provided hA # 2. For the case of real A < 0, write 


1+ (hA/2) | hho 
== (AD) a (es (hX/2) Pir 1 — (hX/2) 


This shows —1 <r <1 for all values of h > 0. Thus 


-_Limity, = 0 (6.5.25) 


noo 


There are no limitations on h in order to have boundedness, of { y,}, and thus 
stability of the numerical method in (6.5.22) is assured for all h > 0 and all 
\ < 0. This is a stronger statement than is possible with most numerical methods, 
where generally h must be sufficiently small to ensure stability. For certain 
applications, stiff differential equations, this is an important consideration. The 
property that (6.5.25) holds for all h > 0 and all complex A with Real(A) < 0 is 
called A-stability. We explore it further in Section 6.8 and Problem 37. 
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Richardson error estimation ‘This error estimation was introduced in Section 
5.4, and it was used both to predict the error [as in (5.4.42)] and to obtain a more 
rapidly convergent numerical integration method [as in (5.4.40)]. It can also be 
used in both of these ways in solving differential equations, although we will use 
it mainly to predict the error. 

Let y,(x) and y,,(x) denote the numerical solutions to y’ = f(x, y) on 
[Xo, 5], obtained using the trapezoidal method (6.5.2). Then using (6.5.19), 


Y(x,) — ¥n(x,) = D(x,)A? + O(n?) 
Y(x,) — Yon(x,) = 4D(x,)h? a O(h*) 


Multiply the first equation by four, subtract the second, and solve for Y(x,): 


(xq) = [4 y6(%4) —eaCxn)] + OCH) (6.5.26) 


The formula on the right side has a higher order of convergence than the 
trapezoidal method, but note that it requires the computation of y,(x,) and 
Y2,(X,,) for all nodes x, in [X9, 5]. 

The formula (6.5.26) is of greater use in predicting the global error in y,(.). 
Using (6.5.26), 


1 i 
Y(x,) — yn(x,) = 5 Lyra) — yon(x,)] + O(h?) 


The left side is O(h*), from (6.5.19), and thus the first term on the right side must 
also be O(h?). Thus 


. 1 
¥ (xq) — Yan) = 3 Lala) — Yon Fn] (6.5.27) 
is an asymptotic estimate of the error. This is a practical procedure for estimating 
the global error, although the way we have derived it does not allow for a variable 
stepsize in the nodes. 


Example Consider the problem 


y=-y?  y(0)=1 


Table 6.8 Trapezoidal method and Richardson error estimation 


x Yan(X) ¥(x) — yon(X) Yn(x) Y(x) — y,(x) 3ly,(x) — Yox(x)] 
1.0 483144 016856 496021 .003979 .004292 
2.0 .323610 .009723 330991 .002342 .002460 
3.0 .243890 -006110 .248521 -001479 .001543 
4.0 194838 004162 198991 .001009 .001051 


5.0 163658 003008 165937 .000730 000759 
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which has the solution ¥(x) =1/(1 + x). The results in Table 6.8 are for 
stepsizes h = .25 and 2h = .5. The last column is the error estimate (6.5.27), and 
it is an accurate estimator of the true error Y(x) — y,(x). 


6.6 A Low-Order Predictor—Corrector Algorithm 


In this section, a fairly simple algorithm is described for solving the initial value 
problem (6.0.1). It uses the trapezoidal method (6.5.2), and it controls the size of 
the local error by varying the stepsize h. The method is not practical because of 


.its low order of convergence, but it demonstrates some of the ideas and 


techniques involved in constructing a variable-stepsize predictor—corrector al- 
gorithm. It is also simpler to understand than algorithms based on higher order 
methods. 

Each step from x, to x,,,, will consist of constructing y,,, from y, and y,_,, 
and y,4, will be an approximate solution of (6.5.2) based on using some iterate 
from (6.5.3). A regular step has x,,,;—- x, =X, —X,-,; =A, the midpoint 
predictor (6.5.15) is used, and the local error is predicted using the difference of 
the predictor and corrector formulas. When the stepsize is being changed, the 
Euler predictor (6.5.12) is used. 

The user of the algorithm will have to specify several parameters in addition to 
those defining the differential equation problem (6.0.1). The stepsize will vary, 
and the user must specify values h,,,, and H,,,, that limit the size of h. The user 
should also specify an initial value for 4; and the value should be one for which 
3hf,(Xo> Yo) is sufficiently less than 1 in magnitude, say less than 0.1. This 
quantity will determine the speed of convergence of the iteration in (6.5.3) and is 
discussed later in the section, following the numerical example. An error toler- 
ance € must be given, and the stepsize h is so chosen that the local error zrunc 
satisfies 


teh < |trunc| < eh (6.6.1) 


at each step. This is called controlling the error per unit stepsize. Its significance is 
discussed near the end of the section. 

The notation of the preceding section is continued. The function u,,(x) is the 
solution of y’ = f(x, y) that passes through (x,, y,). The local error to be 
estimated and controlled is (6.5.11), which is the error in obtaining u,(x,4,) 
using the trapezoidal method: 


3 
Un (Xn41) — Inga = Tote (Xn) OCA) = Xy41— %q (6.6.2) 


If y, is sufficiently close to Y(x,), then this is a good approximation to the 
closely related truncation error in (6.5.1): 


3 
~—y® 
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‘And (6.6.2) is the only quantity for which we have the information needed to 
control it. 


Choosing the initial stepsize The problem is to find an initial value of h and 
node x; = X9 + A, for which |y, — Y(x,)| satisfies the bounds in (6.6.1). With 
the initial h supplied by the user, the value y,(x,) is obtained by using the Euler 
predictor (6.5.12) and iterating twice in (6.5.3). Using the same procedure, the 
values Y, 2(X9 + h/2) and y,/(x,) are also calculated. The Richardson ex- 
trapolation procedure is used to predict the error in y,(x,), 


4 
Y(x1) — (4) = PayAG®) a yn( x) (6.6.3) 


If this error satisfies the bounds of (6.6.1), then the value of fA is accepted, and 
the regular trapezoidal step using the midpoint predictor (6.5.15) is begun. But if 
(6.6.1) is not satisfied by (6.6.3), then a new value of h is chosen. 

Using the values 


h h 
fo = f(x0; Yo) fi =1{% + >? nal + 5}} h = f(x, Yuya xy) 


obtain the approximation 


(fA - 2f, + fo) 
(h7/4) 
This is an approximation using the second-order divided difference of Y’ = 


f(x, Y); for example, apply Lemma 2 of Section 3.4. For any small stepsize h, 
the truncation error at x, + / is well approximated by 


Y(x5) = D3y = (6.6.4) 


we 3 
—— y3) Sel eres 
12 Y (£5) 12 D;y 


The new stepsize h is chosen so that 


an 1 ; 

| 3° 
6e 

i= Don (6.6.5) 
; 


This should place the initial truncation error in approximately the middle of the 
range (6.6.1) for the error per unit step criterion. With this new value of h, the 
test using (6.6.3) is again repeated, aS a safety check. 

By choosing h so that the truncation error will satisfy the bound (6.6.1) when 
it is doubled or halved, we ensure that the stepsize will not have to be changed 
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for several steps, provided the derivative Y@(x) is not changing rapidly. Chang- 
ing the stepsize will be more expensive than a normal step, and we want to 
minimize the need for such changes. 


The regular predictor—corrector step The stepsize h satisfies x, — x,_) = 
Xn41 — X, = Ah. To solve for the value y,,;, use the midpoint predictor (6.5.15) 
and iterate once in (6.5.3). The local error (6.5.11) is estimated using (6.5.17): 


1 3 


~ (pur = 82) = = FUME) + OCF4) 


= [u,(x,41) — Ynsil + O(h*) (6.6.6) 


Thus we measure the local error using 


1 
trance = — &( Your — Jarra (6.6.7) 


If trunc satisfies (6.6.1), then the value of k is not changed and calculation 
continues with this regular step procedure. But when (6.6.1) is not satisfied, the 
values y,,, and x,,, are discarded, and a new stepsize is chosen based on the 
value of trunc. 


Changing the stepsize Using (6.6.6), - 


u@(x,) | trunc 
2 8©=S 


where h, denotes the stepsize used in obtaining trunc. For an arbitrary stepsize 
h, the local error in obtaining y,,, is estimated using 


h? hp 
Un(X_ + A) — Yx(Xnoa) = — Fy Hn Cn) = =| trunc 


Choose fi so that 


0 


oe 6.6.8 
~ Vi 2- ftrunc} 00-8) 


Calculate y,,, by using the Euler predictor and iterating twice in (6.5.3). Then 
return to the regular predictor—corrector step. To avoid rapid changes in fA that 
can lead to significant errors, the new value of d is never allowed to be more than 
twice the previous value. If the new value of h is less than h,,;,, then calculation 


AP 1 " 
he |trunc| = 5€ , 
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‘is terminated. But if the new value of h is greater than /,,,,, we just let h = h_,y 


and proceed with the calculation. This has possible problems, which are discussed 
following the numerical example. 


Algorithm Detrap(f, Xo, Yo. Xena> € As Amin 2 maxs 18) 


1. 


10. 


Remark: The problem being solved is Y’ = f(x, Y), ¥(x9) = 
Yo: for Xp S X < Xenq, USING the method described earlier in the 
section. The approximate solution values are printed at each 
node point. The error parameter ¢ and the stepsize parameters 
were discussed earlier in the section. The variable ier is an 
error indicator, output when exiting the algorithm: ier = 0 
means a normal return; ier = 1 means that h =h,,, at some 
node points; and ier = 2 means that the integration was 
terminated due to a necessary h <h,,,. 


Initialize: loop := 1, ier = 0. 

Remark: Choose an initial value of h. 

Calculate y,(x9 +h), Ya r(Xo + (h/2)), Ya r(Xo + h) using 
method (6.5.2). In each case, use the Euler predictor (6.5.12) 


and follow it by two iterations of (6.5.3). 


For the error in y,(x, + h), use 


4 
trunc = 3 nal + h) — y,(xo + h)| 


If teh < |trunc}] < €h, or if loop =2, then x, =x)+Ah, 
y, = yY, (Xo + A), print x,, y,, and go to step 10. 


Calculate D,y = Y®)(xq) from (6.6.4). If D,y # 0, then 


6e |? 
" - | | 
|D3y| 


If D,y = 0, then k= h,,,, and loop = 2. 


If h <h,,,, then ier = 2 and exit. If h > h,,,, then h = Ah,,,, 
ier = 1, loop := 2. 


Go to step 4. 


Remark: This portion of the algorithm contains the regular 
predictor—corrector step with error control. 
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M1. Let x, =x, +h, and yf == yy + 2hf(x,, y;). Iterate (6.5.3) 
once to obtain y). 


1 
12. trunc = — rez — yf), 


13. If |trunc| > eh or [trunc| < eh, then go to step 16. 
14. Print x,, y). 


15. X9 = Xq, Xy = Xz, Yo = Vy Yi = Ya. If x1 < Xena, then go to 
step 11. Otherwise exit. 


16. Remark: Change the stepsize. 
17. X9*= X, Yo*= Vy, Ag = A, and calculate A using (6.6.8) 
18. h= Min{h,2h9}. 


19. If A <A,,,, then ier = 2 and exit. If 4 > A,,,,, then ier = 1 
and h =h,..- 


20. yf? = yy) + Af(xXo, Yo), and iterate twice in (6.5.3) to calculate 
y,. Also, x; =X +h. 


21. Print x,, y,. 
22. If x; < x.,q, then go to step 10. Otherwise, exit. 


The following example uses an implementation of Detrap that also prints trunc. 
A section of code was added to predict the truncation error in y, of step 20. 


Example Consider the problem 


—2y?  y(0) =0 (6.6.9) 


which has the solution 


x 
1+ x? 


¥(x) = 


This is an interesting problem for testing Detrap, and it performs quite well. The 
equation was solved on [0,10] with A, = .001, A... = 1.0, h=.1, and «= 
.0005. Table 6.9 contains some of the results, including the true global error and 
the true local error, labeled True le. The latter was obtained by using another 
more accurate numerical method. Only selected sections of output are shown 
because of space. 
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Table 6.9 Example of algorithm Detrap 


x, h Yn Y(%n) — Yn trunc True le 
.0227 .0227 .022689 5.84E — 6 5.84E — 6 5.84E — 6 
0454 0227 045308 1L.17E — § 5.83E — 6 5.84E — 6 
0681 0227 .067787 1.74E - 5 5.76E — 6 5.75E — 6 
.0908 0227 090060 2.28E — 5 5.62E — 6 5.61E — 6 
.2725 0227 253594 5.16E — 5 2.96E — 6  -285E— 6° 
-3065 .0340 -280125 5.66E — 5 6.74E — 6 6.79E — 6 
3405 .0340 305084 6.01E — 5 6.21E — 6 5.73E — 6 
-3746 .0340 328411 6.11E — $ 4.28E — 6 3.54E — 6 
4408 -.0662 369019 5.05E — 5 —6.56E — 6 —5.20E — 6 
-5070 -0662 403297 2.44E — 5 —1.04E — 5 —2.12E ~ 5 
.5732 .0662 431469 —2.03E — 5 —2.92E — 5 —-421E -— 5 
-6138 .0406 445879 —2.99E — 5 —1.12E - 5 -1.10E—-—5 

1.9595 135 -404982 —1.02E — 4 —1.64E — 5 —167E - 5 
2.0942 135 388944 —-103E—4 —1.79E —5 -211E -—5 
2.3172 223 363864 —6.57E — 5 1.27E — 5 8.15E — 6 
2.7632 446 319649 3.44E — 4 441E — 4 3.78E — 4 
3.0664 .303 -294447 3.21E — 4 9.39E — 5 8.41E — 5 
7.6959 672 127396 3.87E — 4 8.77E — 5 112E —- 4 
8.6959 1.000 113100 3.96E ~— 4 1.73E — 4 157E — 4 
9.6959 1.000 101625 4.27E — 4 1.18E — 4 1.68E — 4 
10.6959 1.000 .092273 411E — 4 9.45E — 5 L21E — 4 


We illustrate several points using the example. First, step 18 is necessary in 
order to avoid stepsizes that are far too large. For the problem (6.6.9), we have 


—6(x* — 6x? + 1) 


(3)/ = 
ey) (1 + x?)* 
which is zero at x = +.414, +2.414. Thus the local error in solving the problem 
(6.6.9) will be very small near these points, based on (6.6.2) and the close relation 
of u,(x) to Y(x). This leads to a prediction of a very large h, in (6.6.8), one 
which will be too large for following points x,. At x, = 2.7632, step 18 was 
needed to avoid a misleadingly large value of A. As can be observed, the local 
error at x, = 2.7632 increases greatly, due to the larger value of h. Shortly 
thereafter, the stepsize h is decreased to reduce the size of the local error. 

In all of the derivations of this and the preceding section, estimates were made 
that were accurate if h was sufficiently small. In most cases, the crucial quantity 
is actually hf,(x,,, y,), a8 in (6.5.7) when analyzing the rate of convergence of the 
iteration (6.5.3). In the case of the trapezoidal iteration of (6.5.3), this rate of 
convergence is 


1 
Rate = zy Xn Yn) (6.6.10) 
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and for the problem (6.6.9), the rate is 
Rate = —2hy, 


If this is near 1, then several iterations are necessary to obtain an accurate 
estimate of y,,,. From the table, the rate roughly. increases in size as A increases. 
At x, = 2.3172, this rate is about .162. This still seems small enough, but the 
local error is more inaccurate than previously, and this may be due to a less 
accurate iterate being obtained in (6.5.3). The algorithm can be made more 
sophisticated in order to detect the problems of too large an A, but setting a 
reasonably sized h,,,, will also help. 


The global error We begin by giviag an error bound analogous to the bound 
(6.5.18) for a fixed stepsize. Write 


¥(xX,34) ~Vn+ = [VGes) ee 65) a [un (Xn+1) — Ynarl (6.6.11) 
For the last term, we assume the error per unit step criterion (6.6.1) is satisfied: 
|ua(X_a1) eis Ves | s €(Xp41 = x,) (6.6.12) 


For the other term in (6.6.11), introduce the integral equation reformulations 


¥(x) = Y(x,) + [Me ¥(t)) dt 


un) = Int [F(t walt) a x>x, (6.6.13) 


Subtract and take bounds using the Lipschitz condition to obtain 


| ¥@) — uf) | s fe, | + Ke — x,)Max | Y(t) — 4,( | x2 Xx, 


vasl=ex 


with e, = Y(x,,) — y,- Using this, we can derive 


[¥(%n41) — Un(Xna1)| < es een iol (6.6.14) 
Introduce 
H= Max (xqe1 ~ 4) = Max h 
and assume that 
| HK <1 


Combining (6.6.11), (6.6.12), and (6.6.14), we obtain 


1 
lenzal S To HK! +eH XxypSx,<55 
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This is easily solved, much as in Theorem 6.3, obtaining 
e6(4—%o) -1 
[¥(xq) — yn] se] ¥(xo) — yo] + | ———— Je. (6.6.15) 


for an appropriate constant c > 0. This is the basic error result when using a 
variable stepsize, and it is a partial justification of the error criterion (6.6.1). 

In some situations, we can obtain a more realistic bound. For simplicity, 
assume f,(x, y) < 0 for all (x, y). Subtracting in (6.6.13), 


¥(x) ~ g(x) =e + f [P(t YC) ~ F(t mg (0))] at 


x Af (t, $(t)) 
=, + foe 


wo LY(t) — u,(0)] at 


a 


The last step uses the mean-value theorem 1.2, and it can be shown that 
of(t, &(t))/a@y is a continuous function of 7. This shows that v(x) = Y(x) ~ 
u,,(x) is a solution of the linear problem 


oft x, (x 
qt) = Vi. eee 


The solution of this linear problem, along with the assumption /,(x, y) < 0,. 
implies : 


[¥(x,41) = Un(Xp+1) | = lent 


The condition f,(x, y) <0 is associated with well-conditioned initial value 
problems, as was noted earlier in (6.1.8). 
Combining with (6.6.11) and (6.6.12), we obtain 


|¥(xn41) — Iya <|Y¥(x,) Jn + €(Xn44 Zi Xn) 
Solving the inequality, we obtain the more realistic bound 
[¥(x,) — yal s|¥(%0) — Yo] + €(%n — X0) (6.6.16) 


This partially explains the good behavior of the example in Table 6.9; and even 
better theoretical results are possible. But results (6.6.15) and (6.6.16) are suffi- 
cient justification for the use of the test (6.6.1), which controls the error per unit 
step. For systems of equations y’ = f(x,y), the condition f,(x, y) < 0 is replaced 
by requiring that all eigenvalues of the Jacobian matrix f,(x, Y(x)) have real 
parts that are zero or negative. 

The algorithm Detrap could be improved in a number of ways. But it 
illustrates the construction of a predictor—corrector algorithm with variable 
stepsize. The output is printed at an inconvenient set of node points x,, but a 
simple interpolation algorithm can take care of this. The predictors can be 
improved upon, but that too would make the algorithm more complicated. In the 
next section, we return to a discussion -of currently available practical 
predictor—corrector algorithms, most of which also vary the order. 
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6.7 Derivation of Higher Order Multistep Methods 


Recall from (6.3.1) of Section 6.3 the general formula for a p + 1 step method 
for solving the initial value problem (6.0.1): 


Vn+1 = z inj +h 3 bf (x, n-j? Yn-j) n2>p (6.7.1) 
j=l 


A theory was given for these methods in Section 6.3. Some specific higher order 
methods will now be derived. There are two. principal means of deriving higher 
order formulas: (1) The method of undetermined coefficients, and (2) numerical 
integration. The methods based on numerical integration are currently the most 
popular, but the perspective of the method of undetermined coefficients is still 
important in analyzing and developing numerical methods. 

The implicit formulas can be solved by iteration, in complete analogy with 
(6.5.2) and (6.5.3) for the trapezoidal method. If b_, # 0 in (6.7.1), the iteration 
is defined by 


P 
Dei? a » [a;¥,-j + hbjf (x,-}»Yn—;)| , hb_1f(Xns1» yn) = 
j=0 | 


(6.7.2) 


The iteration converges if hb_,K <1, where K is the Lipschitz constant for 
F(x, y), contained in (6.2.12). The linear rate of convergence will be bounded by 
hb _,K, in analogy with (6.5.5) for the trapezoidal method. 

We look for pairs of formulas, a corrector and a predictor. Suppose that the 
corrector formula has order m, that is, the truncation error if O(h*1) at each 
step. Often only one iterate is computed in (6.7.2), and this means that the 
predictor must have order at least m — 1 in order that the truncation error in 
Yar is also O(h™*1). See the discussion of the trapezoidal method iteration error 
in Section 6.5, between (6.5.3) and (6.5.17), using the Euler and midpoint 
predictors. The essential ideas transfer to (6.7.1) and (6.7.2) without any signifi- 
cant change. 


The method of undetermined coefficients If formula (6.7.1) is to have order 
m > 1, then from Theorem 6.5 it is necessary and sufficient that 


P 
La,= 


j=0 
‘ Ai 
b a,(—j)' +i . b(-j)7 i=1,2,...,m (6.7.3) 
j=0 gerd 
For an explicit method, there is the additional condition that b.,=0. 


For a general implicit method, there are 2p + 3 parameters {a;, b;} to be 
determined, and there are m+ 1 equations. It might be thought that we could 
take m+ 1 = 2p + 3, but this would be extremely unwise from the viewpoint of 
the stability and convergence of (6.7.1). This point is illustrated in the next 
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section and in the problems. Generally, it is best to let m < p + 2 for an implicit 
| method. For an explicit method, stability considerations are often not as im- 
portant because the method is usually a predictor for an implicit formula. 
Example Find all second-order two-step methods. The formula (6.7.1) is 

! 

Yaz = on + 41 Yn-1 i hl b_f(Xp41 Ver) + bof (xn: Yn) 

+b f(x,-1 Int) n>] (6.7.4) 


The coefficients must satisfy (6.7.3) with m = 2: 


agta,=1 -a,+b_,+b+6b,=1 a,+2b_,-— 2b, =1 


t Solving, 
a,=1-—- a) b_, =1— jag — 4B) b, = 1— 3a — 4b (6.7.5) 


with do, by variable. The midpoint method is a special case, in which ay = 0, 
by = 2. The coefficients ag, by can be chosen to improve the stability, give a small 
truncation error, give an explicit formula, or some combination of these. The 
conditions to ensure stability and convergence (other than 0 < a, < 1 and using 
Theorem 6.6) cannot be given until the general theory for (6.7.1) has been given 
in the next section. 


With (6.7.3) satisfied, Theorem 6.5 implies that the truncation error satisfies 
T,(¥) = O(h™**) 


for all Y(x) that are m + 1 times continuously differentiable on [x 9, 5]. This is 
sufficient for most practical and theoretical purposes. But sometimes in construct-- 
ing .predictor—corrector algorithms, it is preferable to have an error of the form 


T,(Y) az Oe aie Sica ¢ 28 Xn—p Ss <, SXy41 (6.7.6) 


for some constant d,,, independent of n. Examples are Euler’s method, the 
midpoint method, and the trapezoidal method. 

| To obtain a formula (6.7.6), assuming (6.7.3) has been satisfied, we first 
express the truncation error T,(Y) as an integral, using the concept of the Peano 
kernel introduced in the last part of Section 5.1. Expand Y(x) about x, _ is 


¥(x) = § See) yore, + Rea s(t) 


; . Lo pxXnet in tabeay 
| Ryar(x) = f(x = VE LOM(t) ae 


(x= nga {79° x2t 


0 x<t 
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Substitute the expansion into the formula (6.3.4) for T,(Y). Use the assumption 
that the method is of order m to obtain 


T,(Y) = T,(Rmr) 
= Rina t(%n41) ~ be >R msi( Xn ne th > b Rin Xn ») 


7,(¥) = ["G(t— x,-p)YO"P(t) at | (6.7.7) 


Xn—p 


with the Peano kernel 


j=71 


1 m 4 m— 
G(s) = 55 {loos 9-| Bory gas), +hm 7 b(xp-; ~s)” | 
! jz 

=Tf,((x~-s)f) O<s<x,,, (6.7.8) 

The function G(s) is also often called an influence function. 
The m+ 1 conditions (6.7.1) on the coefficients in (6.7.3) mean there are 
r= (2p + 3) — (m + 1) free parameters [r = (2p + 3) — (m + 2) for an explicit 
formula]. Thus there are r free parameters in determining G(s). We determine 


those values of the parameters for which G(s) is of one sign on [0, x,,,]. And. 
then by the integral mean value theorem, we will have 


Tyas(¥) = YO"OP(E,) [OG (1 — x,-p) dt 


for some x,,_, < §,, < X,+1- By further manipulation, we obtain (6.7.6) with 


Lo pp+i f 
d, = — [ |-- ha; aap t su. yeaa). 


j=-1 
(6.7.9) 
Again, this is dependent on G(s) being of one sign for0 < 5 < x, 4). 


Example Consider formula (6.7.4), and assume the formula is explicit [b_, = 0]. 
Then 


Vn+1 = 2oVn ee ByVn-1 as hl bof (x, Yn) + bi f(X,-1) In-1)] n>=l (6.7.10) 


with 


= ee a, ane Eee | 
a,=1-~ay by = 2— 3a b| = ~ 74 
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Using the preceding formulas, 
G(s) = 4[(x,- 5)?  ag(x1 — 8). - ay(x9 - 5% 
—2hby( x, — s),,— 2b, (x9 - 5), | 


ts[s(l- a9) +aoh] O<s<h 


1(x,-s) h<s<2h 


The condition that G(s) be of one sign on [0,2h] is satisfied if and only if 
ay) = 0. Then 


T,41(Y) = (540 + 4) Y(E,) Xn-1 s g, s Xn+l (6.7.11) 


Note that the truncation error is a minimum when a, = 0; thus the midpoint 
method is optimal in having a small truncation error, among all such explicit 
two-step second-order methods. For a further discussion of this example, see 
Problem 31. 


Methods based on numerical integration The general idea is the following. 
Reformulate’ the differential equation by integrating it over some interval 
{x X,41]) to obtain ; 


maar? 


¥(Xqe1) = YO%n--) +f F(G YU) at (6.7.12) 


for some r > 0, all n =r. Produce a polynomial P(t) that interpolates the 
integrand Y’(t) = f(t, Y(t)) at some set of node points {x,}, and then integrate 
P(t) over [x,_,;X,+1] to approximate (6.7.12). The three previous methods, 
Euler, midpoint, and trapezoidal, can all be obtained in this way. 


Example Apply Simpson’s integration rule, (5.1.13) and (5.1.15), to the equa- 
tion 


Y(x p41) = Yn) + [P(e YL) at 


Xn-1 


This results in 


Y, 


h he 
ae eat Sea eal = ee PG) 

3 90 
for some x,_, = & = x,4;. This approximation is based on integrating the 
quadratic polynomial that interpolates Y'(t) = f(t, y(t)) on the nodes x, _ |, x, 
X,,- For simplicity, we use the notation Y, = Y(x,), and Y/ = Y’(x;). 
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Dropping the error term results in the implicit fourth-order formula 


h 
Yn+1 =Vn-1 ir 3 fn Yn-1) + 4f (Xn Yn) +S (peas Yaad] ne 1 


(6.7.13) 


This is a well-known formula, and it is the corrector of a classical predictor— 
corrector algorithm known as Milne’s method. From Theorem 6.6, the method 
converges and . 
Max |¥(x,) — y,| = O(h*) 
Xp SX_,Sb 
But the method is only weakly stable, in the same manner as was true of the 
midpoint method in Section 6.4. 


The Adams methods These aie the most widely used multistep methods. They 

are used to produce predictor—corrector algorithms in which the error is con- 

trolled by varying both the stepsize h and the order of the method. This is 

discussed in greater detail later in the section, and a numerical example is given. 
To derive the methods, use the integral reformulation 


Pe ae Oey fre. x0) dt (6.7.14) 


Polynomials that interpolate Y’(t) = f(t, Y(t)) are constructed, and then they are 
integrated over [x,,, X,4 ] to obtain an approximation to Y,,,. We begin with the 
explicit or predictor formulas. 


Case 1. Adams-—Bashforth Methods Let P,(t) denote the polynomial of 


| degree <p that interpolates Y(t) at x,_,,...,x,- The most convenient form 


for P,(t) will be the Newton backward difference formula (3.3.11), expanded 
about x,: 


(i=) (7 =x, )( = Xici) 


= Yy’ oe . 2p 4 oe. 
P(t) = Y¥, + i Vint ah? Wye + 
(t 7 a) ce (2 = Xeogen) 
faa Bie og eee (6.7.15) 


As an illustration of the notation, V ¥; = Y(x,) — Y’(x,_,), and the higher 
order backward differences are defined accordingly [see (3.3.10)]. The interpola- 
tion error is given by 


EXO SAP S33) ee) eee | 


— (atv) A) yerengey x 


(p + 1)! < g, < Xn4+1 (6.7.16) 


n—p 
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Table 6.10 Adams~Bashforth coefficients 


Yo Nn Y2 % Y4 Ys 
: 1 5 3 251 95 
2 12 8 720 288 


provided x,_, <t<x,4, and Y(f) is P + 2 times continuously differentiable. 
See (3.2.11) and (3.1.8) of Chapter 3 for the justification of (6.7.16). 
The integral of P,(t) is given by: 


Xn+d . P 1 77 Xn+) 
f P,(t) dt = hY, + L git (tx, ) (t= Fete) at 
Xq j=lJs- n 


P 
=hY yw’, (6.7.17) 
j=0 


The coefficients y, are obtained by introducing the change of variable s = 
(t — x,,)/h, which leads to 


la : : 
. ya afer Do (s+i~ Das j>l (6.7.18) 


with y) = 1. Table 6.10 contains the first few values of y, Gear (1971, pp. 
104-111) contains additional information, including a generating function for the 
coefficients. ; 

The truncation error in using the interpolating polynomial P,(t) in (6.7.14) is 
given by 


T,(Y) = [EO dt = re see hs Caer er? ol ener eat 


Assuming Y(x) is p + 2 times continuously differentiable for x,_, <x < Xn41, 
Theorem 3.3 of Chapter 3 implies that the divided difference in the last integral is 
a continuous function of ¢. Since the polynomial (t — x,)---(¢—x,_,) 1s 
nonnegative on [x,, X,4;], use the integral mean value theorem (Theorem 1.3) to 
obtain 


T,(Y) = ee oaks xatl [fC = x) ae (2 — ee) dt 


for some x,, < § < x, ,,. Use (6.7.18) to calculate the integral, and use (3.2.12) to 
convert the divided difference to a derivative. Then 


T,(Y) = Yagtt LA AE) Xn-p s g,, = Xn41 (6.7.19) 


There is an alternative form that is very useful for estimating the truncation 
error. Using the mean value theorem (Theorem 1.2) and the analogue for 
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Table 6.11. Adams—Bashforth formulas 


1 
p=0 Yar ht AY + SWYE,) 

h 7a ? 5 3y(3) 
pol Kaa tet GPi nm Mail + Gye Yr (,) 

n 3 
p=2 LAS tee Me 23h, = AGH) ge SYZ 55) geen) 

h ; : : ; 251 ere 
p=3 Your = Yo + sg D55¥s — 59%; -1 + 37% 2 — Mea] + HY OUE,) 
backward differences of Lemma 2 of Section 3.4, we have 

T(Y) = Wey) + O(h?*?) (6.7.20) 


The principal part of this error, hy. ?*1Y,, is the final term that would be 
obtained by using the interpolating polynomial P,,,(t) rather than P,(z). This 
form of the error is a basic tool used in algorithms that vary the order of the 
method to help control the truncation error. 

Using (6.7.17) and (6.7.19), equation (6.7.14) becomes 


P 
Yad TF LAN, + Yueh OG) (6.7.21) 
j=0 


for some x,_, < &, < X,4,- The corresponding numerical method is 


P 
Ynsi=In th Livy, nzp (6.7.22) 
j=0 


In the formula, y; = f(x;, y,), and as an example of the backward differences, 
VY; =3 = Via. Table 6.11 contains the formulas for p = 0,1,2,3. They are 
written in the more usual form of (6.7.1), in which the dependence on the value 
f(Xn—j» Yn=j) is Shown explicitly. Note that the p = 0 case is just Euler’s method. 

The Adams-—Bashforth formulas satisfy the hypotheses of Theorem 6.6, and 
therefore they are convergent and stable methods. In addition, they do not have 
the instability of the type associated with the midpoint method (6.4.2) and 
Simpson’s method (6.7.13). The proof of this is given in the next section, and 
further discussion is postponed until near the end of that section, when the 
concept of relative stability is introduced. 


Case 2 Adams—Moulton Methods Again use the integral formula (6.7.14), 
but interpolate Y(t) = f(t, Y(t)) at the p+1 points. x,4,,..., Xy—p+1 for 
p = 0. The derivation is exactly the same as for the Adams—Bashforth methods, 
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Table 6.12 Adams—Moulton coefficients 


& 8, 5, 8, ry 5, 
A 1 1 1 19 3 
a. % 12 24 ~ 720 ~ 160 


and we give only the final results. Equation (6.7.14) is transformed to 


P 
Yrar = Yat h Yo BV Yiu, + 84h? tPV Pt) (6.7.23) 
j=0 
with x,_p+1 S$, S Xn41- The coefficients 6; are defined by 


§= = [(s-s(s+1) (st y- Das j21 — (6.7.24) 
J: “0 


with 6, = 1, and a few values are given in Table 6.12. The truncation error can be 
put in the form 


© Tyal¥) = hb N a + O(h?*?) (6.7.25) 


just as with (6.7.20). 
The numerical method associated with (6.7.23) is 


F 
Inst =I th YL GV yw, nzprl (6.7.26) 
j=0 


with y = f(x;, y;) as before. Table 6.13 contains the low-order formulas for 
p = 0,1,2,3. Note that the p = 1 case is the trapezoidal method. 

Formula (6.7.26) is an implicit method, and therefore a predictor is necessary 
for solving it by iteration. The basic ideas involved are exactly the same as those 
in Section 6.5 for the iterative solution of the trapezoidal method. If a fixed-order 
predictor—corrector algorithm is desired, and if only one iteration is to be 
calculated, then an Adams—Moulton formula of order m > 2 can use a predictor 


Table 6.13. Adams—Moulton formulas 


1 
PHO Muay YAY — GPP) 
h ’ 1 3yQ) 
p=l Yat = t+ shea t O1- Bere’) 
2 12 
h a 7 1 1 4y(4) 
p=2 Kar = H+ Tira + 8%, — Fal ~ 5g) 


h 19 
p~=3 Yar Yt Fyre + 19% — Sra + Te-2) - 79 POG) 
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of order m or m — 1. The advantage of using an order m — 1 predictor is that 
the predictor and corrector would both use derivative values at the same nodes, 
namely, X,, X,—-1+++>Xn—m+ 2: For example, the second-order Adams~Moulton 
formula with the first-order Adams—Bashforth formula as predictor is just the 
trapezoidal method with the Euler predictor. This was discussed in Section 6.5 
and shown to be adequate; both methods use the single past value of the 
derivative, f(x,, Y,)- 
A less trivial example is the following fourth-order method: 


h 
yo =Sn + 7p 23%» Yn) =. 16f(x,-1: I-54) + BT X52: Yn—2)] 


h 


Ie" =), + 54 (ana: yi) + 19f(x,, Yn) ea Sf (ais Yn—-1) 


+f(Xn-2) Yn—2)] (6.7.27) 


Generally only one iterate is calculated, although this will alter the form of the 
truncation érror in (6.7.23). Let u,(x) denote the solution of y’ = f(x, y) 
passing through (x,, y,)- Then for the truncation error in using the approxima- 


UA Xn) Fy. ys = [u,(X,41) — Yyarl + Fore — y®,] 


Using (6.7.23) and an expansion of the iteration error, 


3h 
Un,(Xn41) —yO.= 5,h°uP(x,) + zhane In) Int =O) + O(h*) 
(6.7.28) 


The first two terms following the equality sign are of order A>. If either (1) more 
iterates are calculated, or (2) a higher order predictor is used, then the principal 
part of the truncation error will be simply 6,h°u©(x,,). And this is a more 
desirable situation from the viewpoint of estimating the error. 

The Adams—Moulton formulas have a significantly smaller truncation error 
than the Adams—Bashforth formulas, for comparable order methods. For exam- 
ple, the fourth-order Adams—Moulton formula has a truncation error .076 times 
that of the fourth-order Adams—Bashforth formula. This is the principal reason - 
for using the implicit formulas, although there are other considerations. Note also 
that the fourth-order Adams—Moulton formula has over twice the truncation 
error of the Simpson method (6.7.13). The reason for using the Adams—Moulton 
formula is that it has much better stability properties than the Simpson method. 


Example Method (6.7.27) was used to solve 


1 . 
= —2y? 0) = 7: 
ye ey y(0) =0 (6.7.29) 
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Table 6.14 Numerical example of the Adams method 


Error for Error for 
x h = 125 h = 0625 Ratio 
2.0 2.07E — 5 1.21E - 6 17.1 
4.0 2.21E — 6 1.20E — 7 18.3 
6.0 3.74E — 7 2.00E — 8 © 18.7 
8.0 1.00E — 7 5.24E —- 9 19.1 
10.0 3.58E — 8 1.83E-—9 © 19.6 


which has the solution Y(x) = x/(1 + x*). The initial values y,, y>, y; were 
taken to be the true values to simplify the example. The solution values were 
computed with two values of A, and the resulting errors at a few node points are 
given in Table 6.14. The column labeled Ratio is the ratio of the error with 
h = .125 to that with 4 = .0625. Note that the values of Ratio are near 16, which 
is the theoretical ratio that would be expected since method (6.7.27) is fourth 
order. 


Variable-order methods At present, the most popular predictor—corrector al- 
gorithms control the truncation error by varying both the stepsize and the order 
of the method, and all of these algorithms use the Adams family of formulas. The 
first such computer programs that were widely used were DIFSUB from Gear 
(1971, pp. 158-166) and codes due to Krogh (1969). Subsequently, other such 
variable-order Adams codes were written with GEAR from Hindmarsh (1974) 
and DE/STEP from Shampine and Gordon (1975) among the most popular. The 
code GEAR has been further improved, to the code LSODE, and is a part of a 
larger program package called ODEPACK [see Hindmarsh (1983) for a descrip- 
tion]. The code DE/STEP has been further improved, to DDEABM, and it is 
part of another general package, called DEPAC [see Shampine and Watts (1980) 
for a description]. In all cases, the previously cited codes use error estimation that 
is based on the formulas (6.7.23) or (6.7.25), although they [and formulas (6.7.22) 
and (6.7.26)] may need to be modified for a variable stepsize. The codes vary in a 
number of important but technical aspects, including the amount of iteration of 
the Adams—Moulton corrector, the. form in which past information about deriva- 
tive f is carried forward, and the form of interpolation of the solution to the 
current output node point. Because of lack of space, and because of the 
complexity of the issues involved, we omit any further discussion of the compara- 
tive differences in these codes. For some further remarks on these codes, see 
Gupta et al. (1985, pp. 16-19). 

By allowing the order to vary, there is no difficulty in obtaining starting values 
for the higher order Adams methods. The programs begin with the second-order 
trapezoidal formula with an Euler predictor; the order is then generally increased 
as extra starting values become available. If the solution is changing rapidly, then 
the program will generally choose a low-order formula, while for a smoother and 
more slowly varying solution, the order will usually be larger. 
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In the program DE of Shampine and Gordon (1985, 186-209), the truncation 
error at X,4, [call it trunc] is required to satisfy 


|trunc,| < ABSERR + RELERR* ly, | (6.7.30) 


This is to hold for each component of the truncation error and for each 
corresponding component y, ; of the solution y, of the given system of differen- 
tial equations. The values ABSERR and RELERR are supplied by the user. The 
value of trunc is given, roughly speaking, by 


trunc = 8,4?" yh 41 


assuming the spacing is uniform. This is the truncation error for the p-step 
formula 


Pp 
YP =y, th Yl byl, 
j=0 


Once the test (6.7.30) is satisfied, the value of y,,; is 
Ynvr = Yeht = Yak) + trunc (6.7.31) 


Thus the actual truncation error is O(h?*3), and combined with (6.7.30), it can 
be shown that the truncation error in y,,, satisfies an error per unit step criteria, 
which is similar to that of (6.6.1) for the algorithm Detrap of Section 6.6. For a 
detailed discussion, see Shampine and Gordon (1975, p. 100). 

The program DE (and its successor DDEABM) is very sophisticated in its 
error control, including the choosing of the order and the stepsize. It cannot be 
discussed adequately in the limited space available in this text, but the best 
reference is the text of Shampine and Gordon (1975), which is devoted to 
variable-order Adams algorithms. The programs DE and DDEABM have been 
well designed .from both the viewpoint of error control and user convenience. 
Each is also written in a portable form, and generally, is a well-recommended 
program for solving differential equations. 


Example Consider the problem 


bo 2p 2 = 
ys (1 =| y(0) =1 (6.7.32) 
which has the solution 
. 2 
¥(x) = ————_-- 
(*) = Gy 96-7) 


DDEABM was used to solve this problem with values output at x = 2, 4,6,..., 20. 
Three values of. ABSERR were used, and RELERR = 0 in all cases. The true 
global errors are shown in Table 6.15. The column labeled NFE gives the number 
of evaluations of f(x, y) necessary to obtain the value y,(x), beginning from xp. 
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Table 6.15 Example of the automatic program DDEABM 


ABSERR = 107? ABSERR = 107° ABSERR = 107° 
x Error NFE Error NFE Error NFE 
4.0 —3.26E — 5 15 1.24E ~ 6 28 2.86E — 10 52 
8.0 6.00E — 4 21 3.86E — 6 42 —1.98E — 9 16 
12.0 1.70E — 3 25 4.93E — 6 54 —2.41E — 9 102 
16.0 9.13E — 4 31 3.73E ~ 6 64 —1.86E — 9 124 
20.0 9.16E — 4 37 1.79E — 6 14 ~9.58E — 10 138 


Global error The automatic computer codes that are discussed previously 
control the local error or truncation error. They do not control the global error in 
the solution. Usually the truncation error is kept so small by these codes that the 
global error is also within acceptable limits, although that is not guaranteed. The 
reasons for this small global error are much the same as those described in 
Section 6.6; in particular, recall (6.6.15) and (6.6.16). 

The global error can be monitored, and we give an example of this below. But 
even with an estimate of the global error, we cannot control it for most equations. 
This is because the global error is composed of the effects of all past truncation 
errors, and decreasing the present stepsize will not change those past errors. In 
general, if the global error is too large, then the equation must be solved again, 
with a smaller stepsize. 

There are a number of methods that have been proposed for monitoring the 
global error. One of these, Richardson extrapolation, is illustrated in Section 6.10 
for a Runge-Kutta method. Below, we illustrate another one for the method of 
Section 6.6. For a general survey of the topic, see Skeel (1986). 

For the trapezoidal method, the true solution Y(x) satisfies 


h 3 
Y(x,41) oe Y(x,,) i zl (en Y(x,)) + f(Xn+1 Y(x,+1))] <5 2D ye) 
with h = x,,,— x, and x, < &, <x,,,,. Subtracting the trapezoidal rule 


h 
In+1 = Vn Bn UG Vn) 7 Cores Yn+idI 


we have 


h ‘ 
Cn4+1 ~ Cn + 5 tls, Vn + en) = f(x; yn) 


bs 
= [F(%n419 Yn Gigi) OF tg a4 Yn+i))} - 7 (és) (6.7.33) 


with e, = Y(x,)— y,. 220. This is the error equation for the trapezoidal 
method, and we try to solve it approximately in order to calculate e,,,;. 


DERIVATION OF HIGHER ORDER MULTISTEP METHODS 393 


Table 6.16 Global error calculation for Detrap 


Xn h e, é, trunc 
0227 0227 5.84E — 6 5.83E — 6 5.84E — 6 
0454 0227 1.17E — 5 1.16E — 5 5.83E — 6 
.0681 0227 1.74E ~ 5 1.73E — 5 5.16E — 6 
.0908 0227 2.28E — 5 2.28E — 5 5.62E — 6 
2725 0227 5.16E — 5 5.19E — 5 2.96E — 6 | 
3065 .0340 5.66E — 5 5.66E — 5 6.74E — 6 
3405 - 0340 6.01E — 5 6.05E — 5 6.21E — 6 
3746 0340 6.11E —5 6.21E — 5 4.28E — 6 
4408 0662 5.05E — 5 5.04E — 5 —6.56E — 6 
5070 0662 2.44E — 5 3.57E — 5 ~1.04E - 5 
5732 0662 —2.03E — 5 4.34E — 6 —2.92E —5 
6138 0406 —2.99E — 5 ~6.76E — 6 —1J2E -—5 
1.9595 1350 —1.02E — 4 —8.76E ~ 5 —1.64E — 5 
2.0942 135 —1.03E — 4 —8.68E — 5 ~1.79E —5 
2.3172 223 —6.57E — 5 —5.08E — 5 1.27E — 5 
2.7632 M46 3.44E — 4 3.17E — 4 4.41E — 4 
3.0664 303 3.21E-—4 | 2.96E — 4 9.39E — 5 
7.6959 672 3.87E — 4 2:69E — 4 8.77E — 5 
8.6959 1.000 3.96E — 4 3.05E — 4 1.73E — 4 
9.6959 1.000 4.27E — 4 2.94E — 4 1.18E — 4 
10.6959 1.000 411E — 4 2.71E — 4 9.45E — 5 


Returning to the algorithm Detrap of Section 6.6, we replace the truncation 
term in (6.7.33) with the variable trunc computed in Detrap. Then we solve 
(6.7.33) for é,,,, which will be an approximation of the true global error e, ,,. 
We can solve for é,,,,; by using various rootfinding methods, but..we use simple 
fixed-point iterations: 


: h 
BT = eat S{ [Sam In Se) = 1 es Ia) 


+ [f(xna1 Vn+1 + é),) —f(Xna2 Halt + trunc (6.7.34) 


for j > 0. We use 6, = é,, and since this is for just illustrative purposes, we 


iterate several times in (6.7.34). This simple idea is closely related to the 
difference correction methods of Skeel (1986). 


Example We repeat the calculation given in Table 6.9, for Detrap applied to 
Eq. (6.6.9). We use the same parameters for Detrap. The results are shown in 
Table 6.16, for the same values of x, as in Table 6.9. The results show that e, 
and é, are almost always reasonably close, certainly in magnitude. The ap- 
proximation e, + @, is poor around x = .5, due to the poor estimate of the 
truncation error in Detrap. Even then these poor results damp out for this 
problem, and for larger values of x,, the approximation e, = é, is still useful. 


394 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS 


6.8 Convergence and Stability Theory 
for Multistep Methods 


In this section, a complete theory of convergence and stability is presented for 
the multistep method 


P P 
Neer = La Ypth XY Gf (%a-j Inj) Xpar SXnar Sb (6.8.1) 
j=0 j=-l 


This generalizes the work of Section 6.3, and it creates the mathematical tools 
necessary for analyzing whether method (6.8.1) is only weakly stable, due to 
instability of the type associated with the midpoint method. _ 

l We begin with a few definitions. The concept of stability was introduced with 
Euler’s method [see (6.2.28) and (6.2.20)], and it is now generalized. Let { y,|0 < 
n < N(h)} be the solution of (6.8.1) for some differential equation y’ = f(x, y), 
for all sufficiently small values of h, say h < ho. Recall that N(h) denotes the 
largest subscript N for which x, < b. For each h < ho, perturb the initial values 
Yoo-++» Yp to new values Zp,..., Zp with 


Max ly, —z,| <€ O<h<hy (6.8.2) 


i O<n<p 


| Note that these initial values are likely to depend on A. We say the family of 
solutions { y,|0 <n < N(h)} is stable if there is a constant c, independent of . 
h < hy and valid for all sufficiently small e, for which 


Max |y,-—z,|< ce O<hA<hy (6.8.3) 
O<n<Nh) 


Consider all differential equation problems 


y=flx,y) yl) = % (6.8.4) 


Se ee a ce with the derivative f(x, y) continuous and satisfying the Lipschitz condition 
(6.2.12), and suppose the approximating solutions { y,} are all stable. Then we 
say that (6.8.1) is a stable numerical method. 

To define convergence for a given problem (6.8.4), suppose the initial values 


Yor-++> Vp Satisfy 


n(h) = Max |¥(x,)—-y,{| 270 as h-0 (6.8.5) 
O<nsp 


Then the solution { y, } is said to converge to ¥(x) if 


Max |¥ On) -~y,| 270 as h->0O (6.8.6) 


Xp SXF 


If (6.8.1) is convergent for all problems (6.8.4), then it is called a convergent 
numerical method. 


H 
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Recall the definition of consistency given in Section 6.3. Method (6.8.1) is 
consistent if 
1 


ae Max x ,ITACY)| > 0 as h-0 


for all functions Y(x) continuously differentiable on [x , b]. Or equivalently from 
Theorem 6.5, the coefficients {a;} and {b,} must satisfy 


Ya,;=1 Ejay + eee (6.8.7) 


s=0 ja>l 


Convergence can be shown to imply consistency; consequently, we consider only 
methods satisfying (6.8.7). As an example of the proof of the necessity of (6.8.7), 
the assumption of convergence of (6.8.1) for the problem 


y=0  y(0)=1 


will imply the first condition in (6.8.7). Just take yy = --- = y, = 1, and observe 
the consequences of the convergence of y,,, to Y(x) = 1. 

The convergence and stability of (6.8.1) are linked to the roots of the 
polynomial 


6 
p(r) =r? 1 — J) are (6.8.8) 
j=0 


Note that p(1) = 0 from the consistency condition (6.8.7). Let 7,..., 7, denote 
the roots of p(r), repeated according to their multiplicity, and let r) = 1. The 
method (6.8.1) satisfies the root condition if 


1. IrJ<1 j=0,1,...,p (6.8.9) 
2. Ir,|=1= p'(r,) #0 (6.8.10) 


The first condition requires all roots of p(r) to lie in the unit circle {z: |z| < 1} 
in the complex plane. Condition (6.8.10) states that all roots on the boundary of 
the circle are to be simple roots of p(r). 

- The main results of this section are pictured in Figure 6.6, although some of 
them will not be proved. The strong root condition and the concept of relative 
stability are introduced later in the section. 


Strong root > Relative 
condition stability 


Convergence <———— Root ==> _~—Es“ Stability 


condition 
Figure 6.6 Schematic of the theory for con- 
sistent multistep methods. 
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Stability theory All of the numerical methods presented in the preceding 
sections have been stable, but we now give an example of an unstable method. 


This is to motivate the need to develop a general theory of stability. 


Example Recall the general formula (6.7.10) for an explicit two-step second- 
order method, and choose a) = 3. Then we obtain the method 


Vas = 3Iq — 2Iqo acm Yn) — 3f(Xp-1s Ya-1)| st = 1 (6.8.11) 
with the truncation error 
(TY) = BRYCE) xy bp ten 
Consider solving the problem y’ = 0, y(0) = 0, which has the solution Y(x) = 0. 


Using yo = y, = 0, the numerical solution is clearly y, = 0, n > 0. Perturb the 
initial data to zy) = €/2, z, = €, for some € # 0. Then the corresponding numeri- 


‘cal solution can be shown to be 


pei 2e > + pe O (6.8.12) 


The reasoning used in deriving this solution is given later in a more general 
context. To see the effect of the perturbation on the original solution, 


Max ly, —z,| = Max [e]2"~) = |ej2%()-1 
XoSx,<5b O<x,<b 


Since N(h) > © as h > 0, the deviation of {z,} from { y,} becomes increas- 
ingly greater as h — 0. The method (6.8.11) is unstable, and it should never be 


used. Also note that the root condition is violated, since p(r) = r? — 3r + 2 has 
the roots rp = 1, 7, = 2. 


To investigate the stability of (6.8.1), we consider only the special equation 
y’=Ay y(0)=1 (6.8.13) 


with the solution ¥(x) = e**, The results obtained will transfer to the study of 
stability for a general differential equation problem. An intuitive reason for this is 
easily derived. Expand Y’(x) = f(x, Y(x)) about (x9, Y) to obtain 


Y'(x) = f (x9, Yo) + f(x, Yo)(x - Xo) + f, (xo, Yo)(¥(x) a Yo) 
= (¥(x) — ¥)) + g(x) (6.8.14) 


with A = f,(xo, Yo) and g(x) = f(Xo, Yo) + £.(%o, Yo(x — Xo). This is a valid 
approximation if x — xg is sufficiently small. Introducing V(x) = Y(x) — Yo, 


V(x) =AV(x) + g(x) (6.8.15) 


The inhomogeneous term g(x) will drop out of all derivations concerning 
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numerical stability, because we are concerned with differences of solutions of the 
equation. Dropping g(x) in (6.8.15), we obtain the model equation (6.8.13). As 
further motivation, refer back to the stability results (6.1.5)—(6.1.10), and to the 
trapezoidal] stability results in (6.5.20)—(6.5.23). 

In the case that y’ = f(x,y) represents a system of m differentiable equations, 
as in (6.1.13), the partial derivative { y(x, y) becomes a Jacobian matrix, 


of; & 
oaMly= 3 l<i,j<m 


as in (6.2.54), Thus the model equation becomes 
y' = Ay +(x) (6.8.16) 


a system of m linear differential equations with A = f,(x9, Yo). It can be shown 
that in most cases, this system reduces to an equivalent system 


zi = ),z;+ 7,(x) l<is<m (6.8.17) 


with A,,..., A,, the eigenvalues of A (see Problem 24). With (6.8.17), we are back 
to the simple model equation (6.8.13), provided we allow A to be complex in 
order to include all possible eigenvalues of A. 

Applying (6.8.1) to the model equation (6.8.13), we obtain 


ar * 
Ynt+1 = 3s 4iYn-j + hr iy Dn; (6.8.18) 
j=0 


j=-l 
P 
(1 — AAD) Yyar- VL (a;+hrb)y,,;=0 n>p (6.8.19) 
j=0 


This is a homogeneous linear difference equation of order p + 1, and the theory 
for its solvability is completely analogous to that of ( p + 1)st-order homoge- 
neous linear differential equations. As a general reference, see Henrici (1962, pp. 
210~215) or Isaacson and Keller (1966, pp. 405-417). 

We attempt to find a general solution by first looking for solutions of the 
special form 


If we can find p+ 1 linearly independent solutions, then an arbitrary linear 
combination will give the general solution of (6.8.19). 
Substituting y, =r” into (6.8.19) and canceling r"~?, we obtain 


P 
(1 — hAb_,)r?*)— JY (a, + hAb,)r? 4 =0 (6.8.20) 
j=0 


This is called the characteristic equation, and the left-hand side is the characteris- 
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tic polynomial. The roots are called characteristic roots. Define 
P : 
o(r) = b_yr?*) + Yb? 
j=0 
and recall the definition (6.8.8) of p(r). Then (6.8.20) becomes 
p{r) — hAo(r) =0 (6.8.21) 
Denote the characteristic roots by 
ry(hd),...,7,(hdr) 
which can be shown to depend continuously on the value of kA. When hd = 0, 
the equation (6.8.21) becomes simply p(r) = 0, and we have r(0)=7, j= 
0,1,..., p, for the earlier roots T; of p(r) = 0. Since rp = 1 is a root of p(r), we 
let ro(hA) be the root of (6.8.21) for which r,(0) = 1. The root ro(hA) is called 


the principal root, for reasons that will become apparent later. If the roots r,(h)) 
are all distinct, then the general solution of (6.8.19) is 


y= Y lo(an)]" n>0 (6.8.22) 
j= F 


But if r(hA) is a root of multiplicity » > 1, then the following are » linearly 
independent solutions of (6.8.19): 


(Loear]"}, (af Cray]"}- (erp Gay]”} 


These can be used with the solution arising from the other roots to generate a 
general solution for (6.8.19), comparable to (6.8.22). 


Theorem 6.7 Assume the consistency condition (6.8.7). Then the multistep 
method (6.8.1) is stable if and only if the root condition (6.8.9), 
(6.8.10) is satisfied. ‘ 


Proof 1. We begin by showing the necessity of the root condition for stabil- 
ity. To do so, assume the opposite by letting 


\7,(0)| > 1 


' for some j. Consider the differential equation problem y’ = 0, y(0) = 0, 
with the solution Y(x) = 0. Then (6.8.1) becomes 


P 
Yns1 = Gn; = NZ=P (6.8.23) 
j=0 


If we take yy = y, = --- = y, = 0, then the numerical solution is clearly 
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y, = 0, with all » > 0. For the perturbed initial values, take 


Zo = €, 2, = €7,(0),..., 2, = €7;(0)” (6.8.24) 


For these initial values, 
P 
Max | y, — 2,1 = €[7;(0)| 
Osnsp 


which is a uniform bound for all small values of A, since the night side is 
independent of h. As ¢ — 0, the bound also tends to zero. 
The solution of (6.8.24) with the initial conditions (6.8.24) is simply 


z, = €[7,(0)]” n>0 


For the deviation from { y, }, 


Max |y, ~ Zy1 = €75(0) |" 
XpSx,5b |: 

As h > 0, N(h) > © and the bound becomes infinite. This proves the 

method is unstable when some |r,(0)| > 1. If the root condition is 

violated instead by assuming (6.8.10) is false, then a similar proof can be 

given. This is left as Problem 29. 


2. Assume the root condition is satisfied. The proof of stability will be 
restricted to the model equation (6.8.13). A proof can be given for the 
general equation y’ = f(x, y), but it is a fairly involved modification of 
the following proof. The general proof involves the solution of nonhomo- 
geneous linear difference equations [see Isaacson and Keller (1966), pp. 
405-417, for a complete development]. To further simplify the proof, we 
assume that the roots 7,(0), j = 0,1,..., p are all distinct. The same will 
be true of 7,(hA), provided the value of h is kept sufficiently small, say 
Osh<hy. 

Let { y,} and {z,} be two solutions of (6.8.19) on [xp, bj, and assume 


Max |y,-z,]/<¢«€ O<h<hy (6.8.25) 
O<sn<sp 


Introduce the error e, = y, — Z,- Subtracting using (6.8.19) for each 
solution, 


P 
(1 — AAD_yena1- Lo (a; +hdrb)e,;=0 (6.8.26) 
j=0 


fOr X p41 S Xy4, S 5. The general solution is 


e,= Sy eGal n>=0 — (6.8.27) 
j=0 
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The coefficients y)(h),..., y,(h) must be chosen so that 
Yot ¥ + °°: +Y, = &9 


Yoro(HA) +0 + Yp%p(hr) = e, 


Yo[to(hA)]? + --- +y,[7,(ha)]? =e, 


The solution (6.8.27) will then agree with the given initial perturbations 
€9,---,@,, and it will satisfy the difference equation (6.8.26). Using the 
bound (6.8.25) and the theory of systems of linear equations, it is fairly 
straightforward to show that 


Max |y,|<cje O<h<hy (6.8.28) 


O<si<p 


for some constant c, > 0. We omit the proof of this, although it can be 
carried out easily by using concepts introduced in Chapters 7 and 8. 

. To bound the solution e, on [x 9,5], we must bound each term 
[r,(XA)]”. To do so, consider the expansion 


r(u) = r,(0) +'ur/(£) (6.8.29) 


for some § between 0 and u. To compute r/(u), differentiate the 
characteristic equation 


p(r,(u)) — uo(r,(u)) =0 
Then 


oe o(7,(u)) 
I) ~ Fu) — wo" Ce) Ces 


By the assumption that 7,(0) is a simple root of p(r) = 0, 0 <j <p, it 
follows that p‘(r,(0)) # 0, and by continuity, p’(r,(u)) # 0 for all suffi- 


ciently small values of u. The denominator in (6.8.30) is nonzero, and we 
can bound r/(u) 


[r'(u)|<c, all [ul <u 


for some uy > 0. 
Using this with (6.8.29) and the root condition (6.8.9), we have 


[r,(hA)| <[7,(0)] + c2[hA] <1 + colAAl 


I[7(na)]" 


< [1 + c,|hAf]” < e2tl*Al < ec2lO-*0)/M (6.8.31) 
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for all 0 < h < hy. Combined with (6.8.27) and (6.8.28), 


Max |e,| <cjefez@-OAl O<h<hy 
Xp SX, 


for an appropriate constant c;. a 


Convergence theory The following result generalizes Theorem 6.6 of Section 
6.3, with necessary and sufficient conditions being given for the convergence of 
multistep methods. 


Theorem 6.8 Assume the consistency condition (6.8.7). Then the multistep 


Proof 


method (6.8.1) is convergent if and only if the root condition (6.8.9), 
(6.8.10) is satisfied. 


1. We begin by showing the necessity of the root condition for conver- 
gence, and again we use the problem y’ = 0, y(0) = 0, with the solution 
Y(x) = 0. The multistep method (6.8.1) becomes 


P 
Vat. = 4 Vn-j n 2p (6.8.32) 
j=0 


with yo,..., y, chosen to satisfy 
n(h) = Max |y,|>0 as h-O (6.8.33) 
O<n<p 
Suppose that the root condition is violated. We show that (6.8.32) is not 
convergent to Y(x) = 0. 


Assume that some |r,(0)| > 1. Then a satisfactory solution of (6.8.32) 
is 


y,=h[r,(0)]|" x9 <x, <b (6.8.34) 
Condition (6.8.33) is satisfied since 
n(h) =Alr,(0)// +0 as h>0 


But the solution { y, } does not converge. First, 

N(A) 
Max |¥(x,) — Jnl = Al7;(0)| 
<x,56 


Xo 


Consider those values of h = b/N(h). Then L’Hospital’s rule can be 


used to show that 


aca N 
pa OL =e 


showing (6.8.32) does not converge to the solution Y(x) = 0. 
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Assume (6.8.9) of the root condition is satisfied, but that some r,(0) is 
a multiple root of p(r) and [r;(0)| = 1. Then the preceding form of proof 
is still satisfactory, but we must use the solution 


y= hn[r,(0)]" 0 <n < N(A) 


This completes the proof of the necessity of the root condition. 


2. Assume that the root condition is satisfied. As with the previous 
theorem, it is too difficult to give a general proof of convergence for an 
arbitrary differential equation. For that, see the development in Isaacson 
and Keller (1966, pp. 405-417). The present proof is restricted to the 
model equation (6.8.13), and again we assume the roots 7,(0) are distinct, 
in order to simplify the proof. 

The multistep method (6.8.1) becomes (6.8.18) for the model equation 
y’=Ay, y(0)=1. We show that the term y,[7)(AA)]” in its general 
solution 


Lal ¥ alo(aa)] : (6.8.35) 


will converge to the solution Y(x) = e** on [0, b]. The remaining terms 
¥[TCAAN, j=1,2,..., p, are parasitic solutions, and they can be 
shown to converge to zero as h — 0 (see Problem 30). 

Expand 7)(hA) using Taylor’s theorem, 


ry(hA) = r9(0) + hAr’(0) + O(h?) 
From (6.8.30), 


ry 2) 
fl) = p’(1) 


and using consistency condition (6.8.7), this leads to 75(0) = 1. Then 
r(hA) = 1+ hd + O(h?) = e™* + O(h?) 
[7o(hA)]” = e®*[1 + O(h?)]" = e**[1 + O(h)] (6.8.36) 
over every finite interval 0 < x, < b. Thus 


Max |[7(#A)]"-e%%*|>0 as h>0 (6.8.37) 
O<x,<5 


We must now show that the coefficient yp > 1 as h > 0. 


CONVERGENCE AND STABILITY THEORY FOR MULTISTEP METHODS 403 
The coefficients yo(4),..., ¥,(4) satisfy the linear system 
Voy ee ry aN 
volm(AA)] +--+ +47 (AA] =r (6.8.38) 
Yo[7(4A)]” side +y,[7,(HA)] : =», 
The initial values yp,..., y, depend on A and are assumed to satisfy 


n(h) = Max je*-y,|>0 as h-0 


Osnsp 
But this implies 
Limit y, = 1 O<ns<p (6.8.39) 
h-0 
The coefficient yg can be obtained by using Cramer’s rule to solve 
(6.8.38): 
My 21 1 
Jy Pr. 1 r ? 
ae 
= 6.8.40 
Yo 1 1 1 ( ) 
mh r, 
rf ses oF 


The denominator converges to the Vandermonde determinant for r(0) 
= 1, 7,(0),..., 7,(0), and this is nonzero since the roots are distinct (see 
Problem 1 of Chapter 3). By using (6.8.39), the numerator converges to 
the same quantity as’ h ~ 0. Therefore, yp > 1 as A — 0. Using this, 
along with (6.8.37) and Problem 30, the solution {y,} converges to 
Y(x) = e** on (0, 5}. a 


The following is a well-known result; it is a trivial consequence of Theorems 
6.7 and 6.8. , 


Corollary Let (6.8.1) be a consistent multistep method. Then it is convergent if 
and only if itis stable. - a 


Relative stability and weak stability Consider again the model equation (6.8.13) 
and its numerical solution (6.8.22). The past theorem stated that the parasitic 
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solutions yl7(AA))" will converge to zero as h — 0. But for a fixed A with 
increasing x,, we also would like them to remain small relative to the principal 
part of the solution yo[7(hA)]”. This will be true if the characteristic roots satisfy 


[7(hA)| <r(hA) =f =1,2,...,p (6.8.41) 


for all sufficiently small values of h. This leads us to the definition of relative 
stability. 

We say that the method (6.8.1) is relatively stable if the characteristic roots 
r(hX) satisfy (6.8.41) for all sufficiently small nonzero values of |hA|. And the 
method is said to satisfy the strong root condition if 


\7,(0)|<1 j=1,2,...,p (6.8.42) 


This is an easy condition to check, and it implies relative stability. Just use the 
continuity of the roots 7,(hA) with respect to AA to have (6.8.42) imply (6.8.41). 
Relative stability does not imply the strong root condition, although they are 
equivalent for most methods [see Problem 36(b)]. If a multistep method is stable 
but not relatively stable, then it will be called weakly stable. 


Example 1. For the midpoint method, 
ro(hA) =1+ hdA+ O(h?) r,(hA) = -—1+hA+ O(h?) (6.8.43) 


It is weakly stable according to (6.8.41) when A < 0, which agrees with what was 
shown earlier in Section 6.4. 


2. The Adams—Bashforth and Adams—Moulton methods, (6.7.22) and (6.7.26), 
have the same characteristic polynomial when h = 0, 


p(r) =r?tl— rP (6.8.44) 


The roots are rp = 1, = 0, j =1,2,..., p; thus the strong root condition is 
satisfied and the Adams methods are relatively stable. 


Stability regions In the preceding discussions of stability, the values of h were 
required to be sufficiently small in order to carry through the derivations. Little 
indication was given as to just how small kA should be. It is clear that if A is 
required to be extremely small, then the method is impractical for most prob- 
lems; thus we need to examine the permissible values of h. Since the stability 
depends on the characteristic roots, and since they in turn depend on AA, we are 
interested in determining the values of hA for which the multistep method (6.8.1) 
is stable in some sense. To cover situations arising when solving systems of 
differential equations, it is neceesary that the value of A be allowed to be 
complex, as noted following (6.8.17). 

To motivate the later discussion, we consider the stability of Euler’s method. 
Apply Euler’s method to the equation 


yl=Aytg(x) y(0)=% (6.8.45) 
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obtaining 

Ins =In t hlAy, + 8(x,)] n=0 y= YX% (6.8.46) 
Then consider the perturbed problem 

Zn41 = 2, t h[Az, + 8(x,)] n>=0 z=Yte (6.8.47) 
For the original Eq. (6.8.45), this perturbation of Y, leads to solutions Y(x) and 
Z(x) satisfying 

Y(x) —Z(x)=ce** x20 

In this original problem, we would ordinarily be interested in the case with 
Real(A) < 0, since then | Y(x) — Z(x)| would remain bounded as x — 0. We 
further restrict our interest to the case of Real (A) < 0, so that Y(x) — Z(x) - 0 
as x — 00. For such A, we want to find the values of A so that the numerical 
solutions of (6.8.46) and (6.8.47) will retain the behavior associated with Y(x) 


and Z(x). : 
Let e, = z, — y,- Subtracting (6.8.46) from (6.8.47), 


nay =e, thre, =(1thA)e, eg=e 


Inductively, 


e,=(1+hdA)"e (6.8.48) 
Then e, — 0 as x, — oo if and only if 
Jl+hA] <1 (6.8.49) 


This yields a set of complex values hA that form a circle of radius 1 about the 
point —1 in the complex plane. If 4A belongs to this set, then y, — z, 2 0 as 
xX, 2 00, but not otherwise. 

To see that this discussion is also important for convergence, realize that the 
original differential equation can be looked upon as a perturbation of the - 
approximating equation (6.8.46). From (6.2.17), applied to (6.8.45), 


h2 
Your = ¥, + W[AY, + a(xq)] + FYE) (6.8.50) 


Here we have a perturbation of the equation (6.8.46) at every step, not at just the 
initial point x) = 0. Nonetheless, the preceding stability analysis can be shown to 
apply to this perturbation of (6.8.46). The error formula (6.8.48) will have to be 
suitably modified, but it will still depend critically on the bound (6.8.49) (see 
Problem 40). 


Example Apply Euler’s method to the problem 


y’ =Ay + (1 —A)cos(x) — Q +X)sin(x) = -y(0) = 1 (6.8.51) 
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Table 6.17 Euler’s method for (6.8.51) 


A x Error: h = .5 Error: A = .1 Error: h = .01 
-1 1 -2.46E — 1 —432E — 2 —4,22E - 3 
2 -2.55E—-1 -—4.64E — 2 —4,55E ~ 3 

3 —2.66E — 2 —6.78E — 3 —7.22E - 4 

4 2.27E —1 3.91E — 2 3.78E — 3 

5 2.72E — 1 4.91E — 2 4.81E - 3 

-— 10 1 3.98E — 1 —6.99E — 3 —6.99E -— 4 
2 6.90E + 0 —2.90E — 3 —3.08E ~ 4 

3 L11E + 2 3.86E — 3 3.64E — 4 

4 1L77E + 3 T.OTE — 3 7.04E ~ 4 

5 2.83E + 4 3.78E — 3 3.97E — 4 

— 50 1 3.26E + 0 1.06E + 3 -1.39E — 4 
2 1.88E + 3 LI1LE +9 —5.16E — § 

3 1.08E + 6 L17E + 15 8.25E — 5 

4 6.24E + 8 1.23E + 21 141E — 4 

5 3.59E + 11 1.28E + 27 7.00E — 5 


whose true solution is Y(x) = sin(x) + cos(x). We give results for several 
values of A and h. For } = —1, —10, —50, the bound (6.8.49) implies the 
respective bounds on h of 


1 
= 2 0O<h< — = 04 


h< 
0<h<2 0< =r 


ule 


The use of larger values of h gives poor numerical results, as seen in Table 6.17. 


The preceding derivation with Euler’s method motivates our general approach 
to finding the set of all kA for which the method (6.8.1) is stable. Since we 
consider only cases with Real(A) < 0, we want the numerical solution { y,} of 
(6.8.1), when applied to the model equation y’ = Ay, to tend to zero as x, — 00, 
for all choices of initial values yo, y,,..., Vp: The set of all AA for which this is 
true is called the region of absolute stability of the method (6.8.1). The larger this 
region, the less the restriction on / in order to have a stable numerical solution. 

When (6.8.1) is applied to the model equation, we obtain the earlier equation 
(6.8.18), and its solution is given by (6.8.22), namely 


P 
r= Ly[yay]” 20 
= 


provided the characteristic roots ro(hA),..., ro(AA) are distinct. To have this 
tend to zero as n — oo, for all choices of yo,..., yp, it is necessary and sufficient 
to have 


[n(hA)| <1 f=0,1,...,p (6.8.52) 
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The set of all AA satisfying this set of inequalities is also called the region of 
absolute stability. This region is contained in the set defined in the preceding 
paragraph, and it is usually equal to that set. We work only with (6.8.52) in 
finding the region of absolute stability. 


Example Consider the second-order Adams—Bashforth method 
h ‘ ‘ 
Ynt1=In + 513% — Ia} Bl (6.8.53) 


The characteristic equation is 
r?— (14 3hd)r + thd =0 


and its roots are 


= {1 + Bhd + y+ hr + 34702} 
(1+ gna — f1 + nA + 3H»? } 


yi 


n= 


Imaginary 


3.00 


Real 


-3.00. 
Figure 6.7 Stability regions for Adams—Bashforth 
methods. The method of order k is stable inside the 
region indicated left of origin. [Taken from Gear (1971), 
p- 131, with permission.] 
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Imaginary 


Real 


Figure 6.8 Stability regions for Adams—Moulton methods. The 
- method of order & is stable inside the region indicated. [Taken from 
Gear (1971), p. 131, with permission.] 


The region of absolute stability is the set of AX for which 
[ro(AA)| <1 [r,(hA)] <1 
For \ real, the acceptable values of kA are — 1 < AX < 0. 


The boundaries of the regions of absolute stability of the Adams—Bashforth 
and the Adams—Moulton methods are given in Figures 6.7 and 6.8, respectively. 
For Adams—Moulton formulas with one iteration of an Adams—Bashforth pre- 
dictor, the regions of absolute stability are given in Shampine and Gordon (1975, 
pp. 135-140). 

From these diagrams, it is clear that the region of absolute stability becomes 
smaller as the order of the method increases. And for formulas of the same order, 
the Adams—Moulton formula has a significantly larger region of absolute stabil- 
ity than the Adams—Bashforth formula. The size of these regions is usually quite 
acceptable from a practical point of view. For example, the real values of hA in 
the region of absolute stability for the fourth-order Adams—Moulton formula are 
given by —3 < AA < 0. This is not a Serious restriction on A in most cases. 

_ The Adams family of formulas is very convenient for creating a variable-order 
algorithm, and their stability regions are quite acceptable. They will have 
difficulty with problems for which A is negative and large in magnitude, and 
these problems are best treated by other methods, which we consider in the next 
section. 

There are special methods for which the region of absolute stability consists of 
all complex values hd with Real(hA) < 0. These methods are called A-stable, 
and with them there is no restriction on A in order to have stability of the type 
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Table 6.18 Example of trapezoidal rule: h = .5 


x Error: A = —-1 Error: A = —10 Error: 4 = — 50 
2 ~1.13E — 2 —2.78E — 3 ~7.91E — 4 
4 —1.43E -— 2 —8.91E — 5 —8.91E — 5 
6 2.02E — 2 2.77E — 3 ' 472E- 4 
8 —2.86E — 3 —2.22E — 3 —5.11E—4 
10 —1.79E — 2 —9.23E -— 4 ~1.56E — 4 


we have been considering. The trapezoidal rule is an example of such a method 
[see (6.5.24)-(6.5.25)]. 


Example Consider the backward Euler method: 


Yar =n tf (Xn, Ina) 12 20 (6.8.54) 


Applying this to the model equation y’ = A y and solving, we have 


1 n 
Vz = al Yo n>0 (6.8.55) 


Then y, ~ 0 as x, — oo if and only if 


1 


————. <] 

jl — AA 

This will be true for all AA with Real(A) < 0, and the backward Euler method is 
an A-stable method. , 


Example Apply the trapezoidal method to the problem (6.8.51), which was 
solved earlier with Euler’s method. We use a stepsize of h=.5 for A= 
—1, —10, —50. The results are given in Table 6.18. They illustrate that the 
trapezoidal rule does not become unstable as |A| increases, while Real (A) < 0. 


It would be useful to have A-stable multistep methods of order greater than 2. 
But a result of Dahlquist (1963) shows there are no such methods. We examine 
some higher order methods that have most of the needed stability properties in 
the following section. 


6.9 Stiff Differential Equations and the Method of Lines 


The numerical solution of stiff differential equations has, within the past ten to 
fifteen years, become a much studied subject. Such equations (including systems 
of differential equations) have appeared in an increasing number of applications, 
in subjects as diverse as chemical kinetics and the numerical solution of partial 
differential equations. In this section, we sketch some of the main ideas about this 
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subject, and we show its relation to the numerical solution of the simple heat 
equation. 

There are many definitions of the concept of stiff differential equation. The 
most important common feature of these definitions is that when such equations 
are being solved with standard numerical methods (e.g., the Adams methods of 
Section 6.7), the stepsize A is forced to be extremely small in order to maintain 
stability—far smaller than would appear to be necessary based on a considera- 
tion of the truncation error. An indication of this can be seen with Eq. (6.8.51), 
which was solved in Table 6.17 with Euler’s method. In that case, the unknown 
Y(x) did not change with A, and therefore the truncation error was also 
independent of A. But the actual error was strongly affected by the magnitude of 
A, with hA required to satisfy the stability condition |1 + kA| <1 to obtain 
convergence. As |A| increased, the size of h had to decrease accordingly. This is 
typical of the behavior of standard numerical methods when applied to stiff 
differential equations, with the major difference being that the actual values of |A| 
are far larger in real life examples, for example, A = —10°. 

’ We now look at the most common class of such differential equations, basing 
our examination on consideration of the linearization of the system y’ = f(x, y) 
as developed in (6.8.14)-(6.8.17): 


y’ = Ay + g(x) (6.9.1) 


with A = f,(x 9, Yo) the Jacobian matrix of f. We say the differential equation 
y’ = f(x,y) is stiff if some of eigenvalues A, of A, or more generally of f,(x,y), 
have a negative real part of very large magnitude. We study numerical methods 
for stiff equations by considering their effect on the model equation 


y' =dy + g(x) (6.9.2) 


with Real(A) negative and very large in magnitude. This approach has its 
limitations, some of which we indicate later, but it does give us a means of 
rejecting unsatisfactory methods, and it suggests some possibly satisfactory 
methods. 

The concept of a region of absolute stability, introduced in the last section, is 
the initial tool used in studying the stability of a numerical method for solving 
stiff differential equations. We seek methods whose stability region contains the 
entire negative real axis and as much of the left half of the complex plane as 
possible. There are a number of ways to develop such methods, but we only 
discuss one of them—obtaining the backward differentiation formulas (BDFs). 

Let P,(x) denote the polynomial of degree < p that interpolates Y(x) at the 
POINIS Xp 415 Xp_r-++> Xn—p41 for some p = 1: 


j=zr 


P,(x) = = V(x,)iAx) (6.933) 


with {/, ,(x)} the Lagrange interpolation basis functions for the nodes 
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Table 6.19 Coefficients of BDF method (6.9.6) 


Pp B Qo a a2 a a, as 
1 1 1 
2 4 1 
2 3 3 “3 
6 18 9 2 
; ih FI “it Tr 
12 48 36 16 3 
. 25 5 ~ 35 5 ~ 35 
300 300 200 75 12 
a Cy ns Cy 7 Cy) 
60 360 450 400 225 72 10 
& 7 WwW mw a “aoa 


Xnair+++s Xp—p4i [see (3.1.5)]. Use 
PAX) = Y'(Xn41) =f xpi; Y(x,41)) (6.9.4) 


Combining with (6.9.3) and solving for Y(x,4,), 


pul 


Y(x,01) = L a ¥(x,-;) + ABS (Xn+1> ¥(xn41)) (6.9.5) 


The p-step BDF method is given by 


poi ; 
Mn+ = xy Yai + ABS (Xn415 Ynaa) (6.9.6) 


j=0 


The coefficients for the cases of p = 1,...,6 are given in Table 6.19. The case 
Pp =1 is simply the backward Euler method, which was discussed following 
(6.8.54) in the last section. The truncation error for (6.9.6) can be obtained from 
the error formula for numerical differentiation, given in (5.7.5): 


B 


a hte NE) (6.9.7) 


T,(Y) = ~ 


for some X,p41 < §, S Xap 

The regions of absolute stability for the formulas of Table 6.19 are given in 
Gear (1971, pp. 215-216). To create these regions, we must find all values hA for 
which 


I7,(kA)| <1 jf=0,1,...,p (6.9.8) 
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where the characteristic roots r(hA) are the solutions of 


p-l 
r=) arPotd + hABr? (6.9.9) 
j=ud 


It can be shown that for p = 1 and p = 2, the BDFs are A-stable, and that for 
3 <p < 6, the region of absolute stability becomes smaller as p increases, 
although containing the entire negative real axis in each case. For p > 7, the 
regions of absolute stability are not acceptable for the solution of stiff problems. 
For more discussion of these stability regions, see Gear (1971, chap. 11) and 
Lambert (1973, chap. 8). 

There are still problems with the BDF methods and with other methods that 
are chosen solely on the basis of their region of absolute stability. First, with the 
model equation y’ = Dy, if Real(A) is of large magnitude and negative, then the 
solution Y(x) goes to the zero very rapidly, and as {Real(A)| increases, 
the convergence to zero of Y(x) becomes more rapid. We would like the same 
behavior to hold for the numerical solution of the model equation { y, }. But with 
the A-stable trapezoidal rule, the solution [from (6.5.24)] is 


Me™ |——F | Yo CSO 


If |Real(A){ is large, then the fraction inside the brackets is near to —1, and y, 
decreases to 0 quite slowly. Referring to the type of argument used with the Euler 
method in (6.8.45)—(6.8.50), the effect of perturbations will not decrease rapidly 
to zero for larger values of A. Thus the trapezoidal method may not be a 
completely satisfactory choice for stiff problems. In comparison the A-stable 
backward Euler method has the desired behavior. From (6.8.55), the solution of 
the model problem is 


eo 
n= ||» n20 


As |A] increases, the sequence {y,} goes to zero more rapidly. Thus the 
backward Euler solution better reflects the behavior of the true solution of the 


model equation. 

A second problem with the case of stability regions is that it is based on using 
constant A and linear problems. The linearization (6.9.1) is often valid, but not 
always. For example, consider the second-order linear problem 


y” + ay’ + (1+ 5-cos(2ax))y=g2(x) x20 (6.9.10) 
in which one coefficient is not constant. Convert it to the equivalent system 


Vi =Y2 
(6.9.11) 


V4 = -(1+5- cos (27x)) y, — ay, + g(x) 
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We will assume a > 0, |b] < 1. The eigenvalues of the homogeneous equation 
[g(x) = 0} are 


—a+t ya? — 4[14+ b- cos(2ax 
ae (6.9.12) 


These are negative real numbers or are complex numbers with negative real parts. 
On the basis of the stability theory for the constant coefficient (or constant A) 
case, we would assume that the effect of all perturbations in the initial data 
would die away as x — oo. But in fact, the homogeneous part of (6.9.10) will 
have unbounded solutions. Thus there will be perturbations of the initial values 
that will lead to unbounded perturbed solutions in (6.9.10). This calls into 
question the validity of the use of the model equation y’ = Ay + g(x). Its use 
suggests methods that we may want to study further, but by itself, this approach 
is not sufficient to encompass the vast variety of linear and nonlinear problems. 
The example (6.9.10) is taken from Aiken (1985, p. 269). 


Solving the finite difference method We illustrate the problem by considering 
the backward Euler method: 


Yn+1 =n + Af 445 Vas) n> 0 . (6.9.13) 
If the ordinary iteration formula 

WEY =I (Xnar WA) 7 20 S28) 
is used, then 


i+ 2p Of (Xn41> Yn+1) 


Inver — GS ay [yer HAI 


- For convergence, we would need to have 


h Of (Xnar Yn+1) 


<1 (6.9.15) 
dy 


But with stiff equations, this would again force h to be very small, which we are 
trying to avoid. Thus another rootfinding method must be used to solve for y,,, 
in (6.9.13). 

The most popular methods for solving (6.9.13) are based on Newton’s method. 
For a single differential equation, Newton’s method for finding y,,, is 


jer Ie = [1 a Hf, (Xnst> yd) "La, ~ Yn ~ Af (Xpa yP1)] 
(6.9.16) 


for j = 0. A crude initial guess is y, = y,, although generally this can be 
improved upon. With systems of differential equations Newton’s method be- 
comes very expensive. To decrease the expense, the matrix 


I—hfy(x,41,2) some zy, (6.9.17) 
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is used for all j and for a number of successive values of n. Thus Newton’s 
method [see Section 2.11] for solving the system version of (6.9.13) is approxi- 
mated by 


[J a hfy(%q41,2)]8% a yi an a AA(x,41¥42)) 
(6.9.18) 
yl? = yf + BY 

for j => 0. This amounts to solving a number of linear systems with the same 
coefficient matrix. This can be done much more cheaply than when the matrix is 
being modified (see the material in Section 8.1). The matrix in (6.9.17) will have 
to be updated periodically, but the savings will still be very significant when 
compared to an exact Newton’s method. For a further discussion of this topic, 
see Aiken (1985, p. 7) and Gupta et al. (1985, pp. 22-25). For a survey of 
computer codes for solving stiff differential equations, see Aiken (1985, chap. 4). 


The method of lines Consider the following parabolic partial differential equa- 
tion problem: 


U,= U,,+ G(x,t) O<x<1 ¢>0 © (6.9.19) 
U(0,t)=do(t) Ult)= a(t) 120 (6.9.20) 
U(x,0)=f(x) O<x<il (6.9.21) 


The notation U, and U,, refers to partial derivatives with respect to ¢ and x, 
respectively. The unknown function U(x, 1) depends on the time ¢ and a spatial 
variable x. The conditions (6.9.20) are called boundary conditions, and (6.9.21) is 
called an initial condition. The solution U can be interpreted as the temperature 
of an insulated rod of length 1, with U(x, 1) the temperature at position x and 
time f; thus (6.9.19) is often called the heat equation. The functions G, do, dj, 
and f are assumed given and smooth. For a development of the theory of 
(6.9.19)-(6.9.21), see Widder (1975) or any standard introduction to partial 
differential equations. We give the method of lines for solving for U, a numerical 
method that has become much more popular in the past ten to fifteen years. It 
will also lead to the solution of a stiff system of ordinary differential equations. 
Let m > 0 be an integer, and define 6 = 1/m, 


x, = jb j=0O0,1,...,m 


We discretize Eq. (6.9.19) by approximating the spatial derivative. Recall the 
formulas (5.7.17) and (5.7.18) for approximating second derivatives. Using this, 


U(xjayst) — 2U(x;,t) + U(x;1.t) 8? a4U(E,, 2) 
U,.(x;,t) = a ‘ aa 


for j = 1,2,..., m — 1. Substituting into (6.9.19), 
5 U(Xj415 t) - 2U(x;, t) + U(x; 


ut) 
U,(x;,t) = — i ~— + G(x;,, t) 


8 a4U(é;, t) 


i ee l<j<m-1 (6.9.22) 
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Equation (6.9.19) is to be approximated at each interior node point x;. The 
unknown § © [X;-1) Xj4r)- 

‘Drop the final term in (6.9.22), the truncation error in the numerical differenti- 
ation. Forcing equality in the resulting approximate equation, we obtain 


ui(t) = arlujaale) — 2u,(t) + u,;_,(t)] + G(x;, t) (6.9.23) 


for j = 1,2,...,m— 1. The functions u,(t) are intended to be approximations 
of U(x;,t), 1 <j < m— 1. This is the method of lines approximation to (6.9.19), 
and it is a system of m — 1 ordinary differential equations. Note that u(t) and 
u,,(t), which are needed in (6.9.23) for j= 1 and j = m— 1, are given using 
(6.9.20): 


ug(t)=do(t) —u,,(t) = d,(2) (6.9.24) 
The initial condition for (6.9.23) is given by (6.9.21): 
u,(0) = f(x;) l<j<m-1 (6.9.25) 


The name method of lines comes from solving for U(x, t) along the lines (x;, ¢), 
t>0,1<j <m-— 1, in the (x, t)-plane. 

Under suitable smoothness assumptions on the functions do, d,, G, and f, it 
can be shown that 


‘Max |U(x,,t) — u,(t)| < C78? (6.9.26) 
OsteT 


Thus to complete the solution process, we need only solve the system (6.9.23). 
It will be convenient to write (6.9.23) in matrix form. Introduce 


u(t) = [u,(7z),..., Um —y(t)] 7 Up = aes eer: Care he 


a(t) = Fano #'G( 2757) C(ia 2). Clegg. t), 


| T 
pital) + Opn f) 
-2 1 0 0 0 
1 -2 1 0 
1 0 Lo o2- A 
A= ‘ (6.9.27) 
1 -2 1 
0 0 0 1 -2 


The matrix A is of order m — 1. In the definitions of u and g, the superscript T 
indicates matrix transpose, so that u and g are column vectors of length m — 1. 
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Using these matrices, Eqs. (6.9.23)—(6.9.25) can be rewritten as 

u(t) = Au(t)+ g(t)  u(0) =u, (6.9.28) 
If Euler’s method is applied, we have the numerical method 


Via, =V, + ACAV, + 2(t,)]  Vo=uy (6.9.29) 


n 


with ¢,, = nh and V, = u(t,,). This is a well-known numerical method for the heat 
equation, called the simple explicit method. We analyze the stability of (6.9.29) 
and some other methods for solving (6.9.28). 

Equation (6.9.28) is in the form of the model equation, (6.9.1), and therefore 
we need the eigenvalues of to examine the stiffness of the system. It can be 
shown that these eigenvalues are all real and are given by 


44 [im 
A; = -=sin (=| l<j<m-1 (6.9.30) 


We leave the proof of this to Problem 6 in Chapter 7. Directly examining this 
formula, 


Am-1 SA; S41 (6.9.31) 
: 285 i Da —4 
eT ie 2m 8? 
—4 7 
A= Sr sin (=~) + —-q7? 


with the approximations valid for larger m. It can be seen that (6.9.28) is a stiff 


system if 6 is small. 
Applying (6.9.31) and (6.8.49) to the analysis of stability in (6.9.29), we must 


have 


IL+hAJ <1 j=l,...,m—-] 
J 


Using (6.9.30), this leads to the equivalent statement 


4h _,(Jj7 : 
0< 57 sin om <2 l<j<m-1 


This will be satisfied if 44/8? < 2, or 
h < 48? (6.9.32) 


If 5 is at all smail, say § = .01, then the time-step A must be quite small to have 
stability, 
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In contrast to the restriction (6.9.32) with Euler’s method, the backward Euler 
method has no such restriction since it is A-stable. The method becomes 


Voor =V, + h[AVi41 + 2(t,41)]  VYo= Uo (6.9.33) 
To solve this linear problem for V,.,.,, 
(1 AA)V, 41 = Vn + he ty 1) (6.9.34) 


This is a tridiagonal system of linear equations (see Section 8.3). It can be solved 
very rapidly, with approximately 5m arithmetic operations per time step, exclud- 
ing the cost of computing the right side in (6.9.34). The cost of solving the Euler 
method (6.9.29) is almost as great, and thus the solution of (6.9.34) is not 
especially time-consuming. 


Example Solve the partial differential equation problem (6.9.19)—(6.9.21) with 
the functions G, dy, d,, and f determined from the known solution 


U=e""sin(ax) O<x<1 120 (6.9.35) 


Results for Euler’s method (6.9.29) are given in Table 6.20, and results for the 
backward Euler method (6.9.33) are given in Table 6.21. 

For Euler’s method, we take m = 4,8,16, and to maintain stability, we take 
h = 67/2, from (6.9.32). Note this leads to the respective time steps of h = .031, 
.0078, .0020. From (6.9.26) and the error formula for Euler’s method, we would 
expect the error to be proportional to 67, since h = 57/2. This implies that the 


Table 6.20 The method of lines: Euler’s method 


Error Error Error 
t m=4 Ratio m=8 Ratio m= 16 
1.0 3.89E — 2 4.09 9.52E — 3 4.02 2.37E — 3 
2.0 3.19E — 2 409 . 7.79E-3 4.02 1.94E — 3 
3.0 2.61E — 2 4.09 6.38E — 3 4.01 1.59E — 3 
4.0 2.14E — 2 4.10 5.22E — 3 4.02 1.30E — 3 
5.0 1.75E — 2 4.09 4.28E — 3 4.04 1.06E — 3 


Table 6.21 The method of lines: backward Euler’s method 


Error -  Exror Error 
1.0 445E — 2 1.10E -— 2 2.86E — 3 
2.0 3.65E — 2 9.01E — 3 2.34E — 3 
3.0 -  2,99E - 2 T37TE — 3 1.92E — 3 
4.0 2.45E — 2 6.04E — 3 1.57E -— 3 


5.0 2.00E — 2 4.94E — 3 1.29E — 3 
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error should decrease by a factor of 4 when m is doubled, and the results in 


Table 6.20 agree. In the table, the column Error denotes the maximum error at 
the node points (x,, 1), 0 <j <n, for the given value of f. 

For the solution of (6.9.28) by the backward Euler method, there need no 
longer be any connection between the space step 6 and the time step h. By 
observing the error formula (6.9.26) for the method of lines and the truncation 
error formula (6.9.7) (use p = 1) for the backward Euler method, we see that the 
error in solving the problem (6.9.19)—(6.9.21) will be proportional to h + 87. For 
the unknown function U of (6.9.34), there is a slow variation with ¢. Thus for the 
truncation error associated with the time integration, we should be able to use a 
relatively large time step h as compared to the space step 6, in order to have the 


-. two sources of error be relatively. equal in size. In Table 6.21, we use h = .1 and 


m = 4,8,16. Note that this time step is much larger than that used in Table 6.20 
for Euler’s method, and thus the backward Euler method is much more efficient 
for this particular example. 


For more discussion of the method of lines, see Aiken (1985, pp. 124-148). 
For some method-of-lines codes to solve systems of nonlinear parabolic partial 
differential equations, in one and two space variables, see Sincovec and Madsen 
(1975) and Melgaard and Sincovec (1981). 


6.10 Single-Step and Runge-Kutta Methods 


Single-step methods for solving y’ = f(x, y) require only a knowledge of the 
numerical solution y, in order to compute the next value y,,,. This has obvious 
advantages over the p-step multistep methods that use several past values 
{¥no-++> Yn—p+i}> Since then the additional initial values { y,,..., Yp—1} have to 
be computed by some other numerical method. 

The best known one-step methods are the Runge-Kutta methods. They are 
fairly simple to program, and their truncation error can be controlled in a more 
straightforward manner than for the multistep methods. For the fixed-order 
multistep methods that were used more commonly in the past, the Runge-Kutta 
methods were the usual tool for-calculating the needed initial values for the 
multistep method. The major disadvantage of the Runge-Kutta methods is that 
they use many more evaluations of the derivative f(x, y) to attain the same 
accuracy, aS compared with the multistep methods. Later we will mention some 
results on comparisons of variable-order Adams codes and fixed-order 
Runge-Kutta codes. 

The most simple one-step method is based on using the Taylor series. Assume 
Y(x) is 7 + 1 times continuously differentiable, where Y(x) is the solution of the 
initial value problem 


y=f(x,y) v(x) = % (6.10.1) 


Expand Y(x,) about x, using Taylor’s theorem: 
r r+i 


Y(x,) = Y(xq) + AY'(xo) + °° i Y¥)(x9) + Gap (6.10.2) 
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for some x9 < 9 < x,. By dropping the remainder term, we have an approxima- 
tion for ¥(x,), provided we can calculate ¥"(x);-. ., Y(x9). Differentiate 
Y(x) = f(x, ¥(x)) to obtain 


¥"(x) = f(x, ¥(x)) + f(x, Y(x)) ¥(x), 
=f, thf 
and proceed similarly to obtain the higher order derivatives of Y(x). 


Example Consider the problem 


y=-y? — y(0)=1 


with the solution Y(x) =1/(1 +x). Then ¥Y” = —2YY’ = 2Y?, and (6.10.2) 
with r = 2 yields 


3 , 
Y(x,) = %—AYe + bY + oY O(Eo) Xe S Eps hy 


We drop the remainder to obtain an approximation of Y(x,). This can then be 
used in the same manner to obtain an approximation for Y(x,), and so on. The 
numerical method is 


Ynt1 =n — hy thy? = n20 (6.10.3) 


Table 6.22 contains the errors in this numerical solution at a selected set of node 
points. The grid sizes used are h = .125 and A = .0625, and the ratio of the 
resulting errors is also given. Note that when A is halved, the ratio is almost 4. 
This can be justified theoretically since the rate of convergence can be shown to 
be O(h*), with a proof similar to ua given in Theorem 6.3 or Theorem 6.9, 
given later in this section. 


The Taylor series method can give excellent results. But it is bothersome to use 
because of the need to analytically differentiate f(x, y). The derivatives can be 
very difficult to calculate and very time-consuming to evaluate, especially for 
systems of equations. These differentiations can be carried out using a symbolic 


Table 6.22 Example of the Taylor series method (6.10.3) 


h = 0625 = .125 
x Ya) Y(x) — y,(x) Y(x) — Ya) Ratio 
2.0 333649 —3.2E—-4 —14E -—3 4.4 
4.0 200135 -14E-4 —5.9E — 4 43 
6.0 142931 —T4E -—5 —-3.2E—-—4 43 
8.0 111157 -46E — 5 —2.0E — 4 43 


10.0 090941 —3.1E -— 5 _ . a=lL4E-4 4.3 
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manipulation language on a computer, and then it is easy to produce a Taylor 
series method. However, the derivatives are still likely to be quite time-consuming 
to evaluate, and it appears that methods based on evaluating just f(x, y) will 
remain more efficient. To imitate the Taylor series method, while evaluating only 
(x, y), we turn to the Runge-Kutta formulas. 


Runge-Kutta methods The Runge-Kutta methods are closely related to the 
Taylor series expansion of Y(x) in (6.10.2), but no differentiations of f are 
necessary in the use of the methods. For notational convenience, we abbreviate 
Runge-Kutta to RK. All RK methods will be written in the form 

Yns1 =n thE (Xqs InAs f) n=O (6.10.4) 


We begin with examples of the function F, and will later discuss hypotheses for 
it. But at this point, it should be intuitive that we want 


F(x, ¥(x), hs f) = ¥'(x) =f(x, ¥(x)) (6.10.5) 
for all small values of h. Define the truncation error for (6.10.4) by 
T(Y) = ¥(xq41) — YC%,) — AF (oq, Y(x,) Af) 2 20 (6.10.6) 
and define +,(Y) implicitly by 
T,(Y) = hr,(Y) 
Rearranging (6.10.6), we obtain 
¥(x,41) = Y¥(x,) + F(x, Y(x,),45f) +47,(Y) 2 20 (6.10.7) 


In Theorem 6.9, this will be compared with (6.10.4) to prove convergence of { y, } 
to Y. 


Example 1. Consider the trapezoidal method, solved with one iteration using 
Euler’s method as the predictor: 


h 
n+) = Yn + rele Yn) + f(Xnars Vn at hf (Xn a) ne 0 (6.10.8) 


In the notation of (6.10.4), 
F(x, yo hi f) = 40S (xy) + f(x they + f(xy) 
As can be seen in Figure 6.9, F is an average slope of Y(x) on [x, x + h]. 


2. The following method is also based on obtaining an average slope for the 
solution on [x,, X,2;]: 


Yn =In + f(x, + 4h, y, + thf (x, I,)) 220 (6.10.9) 


SINGLE-STEP AND RUNGE-KUTTA METHODS 421 


Slope = fix, Y(x)) 


Slope = f(x th, Yix) +hflx, Y(x))) 


z= Y(x) 


V(x) thF(x, h, V(x); fp 


x xth 


Figure 6.9 Illustration of Runge-Kutta method (6.10.8). 


For this case, 


F(x, y,h; f) = f(x + 3h, y + thf(x, y)) 


The derivation of a formula for the truncation error is linked to the derivation 
of these methods, and this will also be true when considering RK methods of a 
higher order. The derivation of RK methods will be illustrated by deriving a 
family of second-order formulas, which will include (6.10.8) and (6.10.9). We 
suppose F has the general form 


F(x, ¥,h3 f) = nf (x, ») + f(x + ah, y + Bhf(x, y)) (6.10.10) 
in which the four constants y,, y2, a, 8 are to be determined. 
We will use Taylor’s theorem (Theorem 1.5) for functions of two variables to 


expand the second term on the right side of (6.10.10), through the second 
derivative terms. This gives 


F(x, y, Af) = y fx, y) + yt fx, y) + hlaf. + BAA] 


+h?[10%f,. + oB/.,f + 387F%f,,]} + O(4?) (6.10.11) 
Also we will need some derivatives of Y’(x) = f(x, Y(x)), namely 


= fet hf 
(6.10.12) 
YO. = fit Ul thoyl* thet Gt 
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For the truncation error, 
T,(Y) = Y(X,43) = Y(x,) a hF(x,, ¥(x,); h; f) 
2 3 


h h 
= AY, + Ya + Mn) + OCHS) ~ hE (xq, Yn hs f) 


Substituting from (6.10.11) and (6.10.12), and collecting together common powers 
of -h, we obtain 


T,(Y) = All - yy - yf + h?[(4 oe V2) fr 5 (5 aad vB) F,. | 
+h?|(2 — dy07) fe + (4 - 0B) feyf + (4 - dB?) fpf? 
+2h,f,+ 4f2f] + O(h*) (6.10.13) 


All derivatives are evaluated at (x,, Y,). 

We wish to make the truncation error converge to zero as rapidly as possible. 
The coefficient of h? cannot be zero in general, if f is allowed to vary arbitrarily. 
The requirement that the coefficients of h and h? be zero leads to 


1 1 . 
HM+ty=1 ya= 3 2B = oy (6.10.14) 
and this yields 
T,(Y) = O(h°) 


The system (6.10.14) is underdetermined, and its general solution is 


(6.10.15) 


with y, arbitrary. Both (6.10.8) (with y, = 3) and (6.10.9) (with y, = 1) are 
special cases of this solution. 

By substituting into (6.10.13), we can obtain the leading term in the truncation 
error, dependent on only y,. In some cases, the value of y, has been chosen to 
make the coefficient of A? as small as possible, while allowing f to vary 
arbitrarily. For example, if we write (6.10.13) as 


TAY) = c(f,r2.)h? + O(n") (6.10.16) 


then the Cauchy—Schwartz inequality [see (7.1.8) in Chapter 7] can be used to 
show 


‘|e(f v2) ] < calf )eo(r2) - (6.10.17) 


SINGLE-STEP AND RUNGE-KUTTA METHODS 423 


where 


a(S) = [f+ 0207 + ABP + 2 + hp]? 


c(%) = [(3 — dyya?)? + (4 — ve0B)? + (4 - 49982) + 4]? 


with a, B given by (6.10.15). The minimum value of c¢,(y,) is attained with 
¥2 = .75, and c,(.75) = 1/ ¥18 . The resulting second-order numerical method is 


h 2 2 
Inti =)Vn + en Yn) + a(x, - 3 Vn + 3 fxn» »»)} n 2 0 
(6.10.18) 


It is optimal in the sense of minimizing the coefficient c,(y,) of the term c,(f)h? 
in the truncation error. For an extensive discussion of this means of analyzing the 
truncation error in RK methods, see Shampine (1986). 

Higher order formulas can be created and analyzed in an analogous manner, — 
although the algebra becomes very complicated. Assume a formula for 
-F(x, y, h; f) of the form 


P 
F(x, y, hs f)= Ly, (6.10.19) 
j=l : 
V, = f(x, y) 
j-1 
V=f xt ajhy y th DBM, j=2,...,p (6.10.20) 


These coefficients can be chosen to make the leading terms in the truncation error 
equal to zero, just as was done with (6.10.10) and (6.10.14). There is obviously a 
connection between the number of evaluations of f(x, y), call it p, and the 
maximum possible order that can be attained for the truncation error. These are 
given in Table 6.23, which is due in part to Butcher (1965). 

Until about 1970, the most popular RK method has probably been the 
original classical formula that is a generalization of Simpson’s rule. The method 
is 


h 
nti =In + ZIV + Wy + 2K + Ve) (6.10.21) 


1 1 
Yi=fxn In) Va =1(x, + Pha Yq + sv] 


1 1 
V; = (x, + ahs In + shr,] V,=f(x, +h, y, + hV;) 
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Table 6.23 Maximum order of the Runge-Kutta methods 
Number of function evaluations 1 2 3 4 5 6 7 
Maximum order of method 1 2 3 4 4 5 


It can be proved that this is a fourth-order formula with 7,(Y) = O(h°). If 
f(x, y) does not depend on y, then this formula reduces to Simpson’s integration 
rule. 


Example Consider the problem 


, 


y= 


Tayi 729? 90) =0 (6.10.22) 


with the solution Y = x/(1 + x). The method (6.10.21) was used with a fixed 
stepsize, and the results are shown in Table 6.24. The stepsizes are h = .25 and 
2h = .5. The column Ratio gives the ratio of the errors for corresponding node 
points as h is halved. The last column is an example of formula (6.10.24) from 
the following material on Richardson extrapolation. Because T,(Y) = O(h*) for 
method (6.10.21), Theorem 6.9 implies that the rate of convergence of y,(x) to 
Y(x) is O(h*). The theoretical value of Ratio is 16, and as h decreases further, 
this value will be realized more closely. 


The RK methods have asymptotic error formulas, the same as the multistep 
methods. For (6.10.21), 


Y(x) — y,(x) = D(x)h* + O(AS) | (6.10.23) 


where D(x) satisfies a certain initial value problem. The proof of this result is an 
extension of the proof of Theorem 6.9, and it is similar to the derivation of 
Theorem 6.4 for Euler’s method, in Section 6.2. The result (6.10.23) can be used 
to produce an error estimate, just as was done with the trapezoidal method in 
formula (6.5.26) of Section 6.5. For a stepsize of 2h, 


Y(x) — y.,(x) = 16D(x)h4 + O(h’) 


Table 6.24 Example of Runge-Kutta method (6.10.21) 


x Yn(x) ¥(x)—ys(4) YX) — Yon) Ratio 35094 (*) — yon (2)] 
2.0 39995699 43E—5 . 1.0E — 3 24 6.7E — 5 
4.0  .23529159 2.5E — 6 7.0E — 5 28 4.5E — 6 
6.0 16216179 3.7E —7 1.2E — 5 32 ieee | 
8.0  .12307683 9.2E — 8 3.4E — 6 36 22B a4 


10.0 .09900987 3.1E—8 1.3E — 6 41 8.2E — 8 
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Proceeding as earlier for (6.5.27), we obtain 


Y(x) — yx) = seLal2) — sale] + 00H) (6.10.24 


and the first term on the right side is an estimate of the left-hand error. This is 
illustrated in the last column of Table 6.24. 


Convergence analysis In order to obtain convergence of the general schema 
(6.10.4), we need to have 1,(Y) — 0 as h — 0. Since 


¥(x,41) im Y(x,) 


1,(Y) = 


— F(x,,¥(x,), 45 f) 
we require that 
F(x, Y(x), hs f) > Y’(x) = f(x, ¥(x)) as h-0O 


More precisely, define 


8(h) = Max. | f(x, y) _ F(x, y, h; f)| (6.10.25) 
Xpsxsb 
—-o<y<a 
and assume 
8(h) 0 as h-0 (6.10.26) 


This is occasionally called the consistency condition for the RK method (6.10.4). 
We will also need a Lipschitz condition on F: 


|F(x, yh f) — F(x, z, 4; f)| < Ly -2| (6.10.27) 
for all x» < x <b, —c <y, z < oo, and all small A> 0. This condition is 


usually proved by using the Lipschitz condition (6.2.12) on f(x, y). For example, 
with method (6.10.9), 


|F(x, yh: f) — F(x, 2,3 f)| 


; h h A h 
ix + ring + 5f (x. »| -4( + z + 3f(2)] 


5 Ky 2+ S14(%y) (22) 


h 
sx(1+ 5K }y—21 


Choose L = K(1 + 1K) forh <1. 
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Theorern 6.9 Assume that the Runge-Kutta method (6.10.4) satisfies the 
Lipschitz condition (6.10.27). Then for the initial value problem 
(6.10.1), the solution { y, } satisfies 


elo-x)L _ 
Max |¥(x,) — y_| < e-*4| ¥y — yl + | ———— |r(h) 
Xo SXq_Sb L 
(6.10.28) 
where 
t(h) = Max |x,(Y)| (6.10.29) 


XoSX,Sh 


If the consistency condition (6.10.26) is also satisfied. then the 
numerical solution { y, } converges to Y(x). 


Proof Subtract (6.10.4) from (6.10.7) to obtain 


nar = Cpt AL F(X, ¥,,43 £) — F(Xqs ne hs f)] + 47,(¥) (6.10.30) 


in which e, = Y(x,) — y,- Apply the Lipschitz condition (6.10.27) and 
use (6.19.29) to obtain 
lenar] S$ (1+ AL)le,| +hr,(h) Xp) Sxy <b (6.10.30) 

‘As with the convergence proof for the Euler method. this leads easily to 
the result (6.10.28) [see (6.2.20)—(6.2.22)]. 

In most cases, it is known by direct computation that 7(h) > 0 as 
A — 0, and in that case, convergence of { y,} to Y(x) is immediately 
proved. But all that we need to know is that (6.10.26) is satisfied. To see 
this, write 


ha,(Y) = Y(xX,41) nie Y(x,) my hF(x,, ¥(x.); h; f) 
= AY (xg) + VME) — AES Veg) Bf) 
R : h2 
h\t,(¥)| <h8(h) + SAI lhe 


1 
1(h) <8(h) + ShIY a 
Thus 7(h) — Oas h > 0, completing the proof. | 


The following result is an immediate consequence of (6.10.28). 
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Corollary f the Runge-Kutta method (6.10.4) has a truncation error 7,(Y) = 
O(h”™*?), then the rate of convergence of { y,} to Y(x) is O(h”™). 

It is not too difficult to derive an asymptotic error formula for the Runge-Kutta 
method (6.10.4), provided one is known for the truncation error. Assume 

T.(Y) = p(x, )h7*! + O(h™*?) (6.10.31) 

with »(x) determined by Y(x) and f(x, Y(x)). As an example see the result 

(6.10.13) to obtain this expansion for second-order RK methods. Strengthened 


forms of (6.10.26) and (6.10.27) are also necessary. Assume 


OF (x, y, A; : 
Fx ahi) — Flas = FERED 2) 4 of(y-2)) 


(6.10.32) 


and also that 


af(x,y) — AF(x, y. hs f) as 


h)= M 0 h->0 
5i( ) Boek dy dy e e 
—-o<-<0o 
(6.10.33) 


In practice, both of these results are straightforward to confirm. With these 
assumptions, we can derive the formula 


¥(x) — y,(x) = D(x)h™ + O(h™*") (6.10.34) 
with D(x) satisfying the linear initial value problem 
D' = f,(x, Y(x))D(x) +(x) D(x9) = 0 (6.10.35) 


Stability results can be given for RK methods, in analogy with those for 
multistep methods. The basic type of stability, defined near the beginning of 
Section 6.8, is easily. proved. The proof is a simple modification of that of 
(6.2.29), the stability of Euler’s method, and we leave it to Problem 49. An 
essential difference with the multistep theory is that there are no parasitic 
solutions created by the RK methods; thus the concept of relative stability does 
not apply to RK methods. The regions of absolute stability can be studied as 
with the multistep theory, but we omit it here, leaving it to the problems. 


Estimation of the truncation error In order to control the size of the truncation 
error, we must first be able to estimate it. Let u,(x) denote the solution of 
y' = f(x, y) passing through (x, y,) [see (6.5.9)]. We wish to estimate the error 
in y,(x, + 2h) relative to u,(x, + 2h). . 
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Using (6.10.30) for the error in y,(x) compared to u,(x), and using the 
asymptotic formula (6.10.31), 


u,(X;41) — Yi X41) = u,(x;) — y,(x;) 

+h{ F(x, u,(x)), hs f) ~ F(x, %(x)), 45 £)} 
+9, (x,)A™*) + O(h™*?) jan 

From this, it is straightforward to prove 

u,(x, th) — y,(x, +h) = ,(x,)h™*) + O(h™*?) 
u,(X, + 2h) ~ y4(x_ + 2h) = 29,(x,)A"*! + OCA?) 
Applying the same procedure to y,,(x;), 
u(x, + 2h) — you(x, + 2h) = 27*'p,(x,)h™** + O(h™*?) 
From the last two equations, 


u,(x, + 2h) ~ y,(x, + 2h) 
1 
=, Fa Waln + 2h) — Yon(x, + 24)] + O(h"*7) (6.10.36) 


and the first term on the right is an asymptotic estimate of the error on the left 
side. 

Consider the computation of y,(x, + 24) from y,(x,) = y, as a single step in 
an algorithm. Suppose that a user has given an error tolerance « and that the 
value of 


1 i 
trunc = u,(x, + 2h) — y,(x, + 2h) = gay Ll + 2h) — yo, (x, + 24)] 
(6.10.37) 


is to satisfy 


JSeh < |trunc| < 2€h (6.10.38) 


This controls the error per unit step and it requires that the error be neither too 
Jarge nor too small. Recall the concept of error per unit stepsize in (6.6.1) of 
Section 6.6. 

If the test is satisfied, then computation continues with the same 4. But if it is 
not satisfied, then a new value 4 must be chosen. Let it be chosen by 


2\p,(x,)|a"*! = eh 
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where 9,(x,,) is determined from - 


3 trunc 
Pan) = Somat 


With the new value of h, the new truncation error should lie near the midpoint of 
(6.10.38). This form of algorithm has been implemented with a number of 
methods. For example, see Gear (1971, pp. 83-84) for a similar algorithm for a 
fourth-order RK method. 

With many programs, the error estimation (6.10.37) is added to the current 
calculated value of y,(x, + 2A), giving a more accurate result. This is called local 
extrapolation. When it is used, the error per unit step criterion of (6.10.38) is 
replaced by an error per step criterion: 


Se < |trunc| < 2€ (6.10.39) 


In such cases, it can be shown that the local error in the extrapolated value of 
y,(x, + 2h) satisfies a modified error per unit step criteria [see Shampine and 
Gordon (1975), p. 100]. For implementations of the same method, programs that 
use local extrapolation and the error per step criterion appear to be more efficient 
than those using (6.10.38) and not using local extrapolation. 

To better understand the expense of RK methods with the error estimation 
previously given, consider only fourth-order RK methods with four evaluations 
of f(x, y) per RK step. In going from x, to x, + 2h, eight evaluations will be 
required to obtain y,(x, + 2h), and three additional evaluations to obtain 
y2,(x, + 2h). Thus a single step of the variable-step algorithms will require 
eleven evaluations of f. Although fairly expensive to use when compared with a 
multistep method, a variable-stepsize RK method is very stable, reliable, and is 
comparatively easy to program for a computer. 


Runge-Kutta—Fehlberg methods The Runge—Kutta—Fehlberg methods are RK 
methods.in.which the truncation error is computed by comparing the computed 
answer y,,, with the result-of-an associated higher order RK formula: The most 
popular of such methods are due to E. Fehlberg, {e.g., see Fehlberg (1970)}; these 
are currently the most popular RK methods. To clarify the presentation, we 
consider only the most popular pair of Runge—Kutta~Fehlberg (RKF) formulas 
of order 4 and 5. These formulas are computed simultaneously, and their 
difference is taken as an estimate of the truncation error in the fourth-order 
method. 

Note from Table 6.23 that a fifth-order RK method requires six evaluations of 
f per step. Consequently, Fehlberg chose to use five evaluations of f for the 
fourth-order formula, rather than the usual four. This extra degree of freedom in 
choosing the fourth-order formula allowed it to be chosen with a smaller 
truncation error, and this is illustrated later. 

As before, define 


Vi = f(x,, Yn) 
j-i 
V,=f tnt ajhy Jy th DBM, er (6.10.40) 
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Table 6.25 Coefficients a; and B,; for the RKF method 


Bi 
ji 4G, j=1 2 3 4 5 
I 1 
a 4 4 
3 3 9 
= 8 32 32 
12 1932 7200 7296 
‘ Bp 39 ~ 2197 2197 
439 3680 845 
5 1 716 -8 S13 ~ Fos 
1 8 3544 1859 ll 
6 20°F : ~ 3565 4042S 0 
Table 6.26 Coefficients y,,1,,c; for the RKF method 
j 1 2 3 4 5 6 
25 1408 2197 1 
a “aie. 2565 4104 “3 
. 16 6656 28561 9 
% ws ° 12825 36430 ~ 30 35 
1 128 2197 e. 
co 30 °° "275 ~ 75240 30 35 
The fourth- and fifth-order formulas are, respectively, 
5 
Vy Ie tk LW ; (6.10.41) 
fom 
6 
ust Int BL WY; (6.10.42) 
ic 
The truncation error in y,,, is approximately 
6 
trunc = $443 —-Ynn. =A DL CV, (6.10.43) 
j=l 


The coefficients are given in Tables 6.25 and 6.26.. 
To compare the classical RK method (6.10.21) and the preceding RKF . 
method, consider the truncation errors in solving the simple problem 


y=x* — y(0)=0 
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Table 6.27 Example of the RKF method 


x Yn (X) Y(X) — Yp(*) ¥(X) — yoa(X) Ratio TL YA(*) — Yon (*)] 

2.0 .40000881 —8.8E ~ 6 ~5.0E ~ 4 57 —3.3E — 5 

4.0 23529469 —~5.8E —7 -40E -— 5 69 ~2.6E — 6 

6.0 .16216226 —9.5E - 8 ~7.9E — 6 83 —5.2E—7 

8.0 12307695 —2.6E — 8 —2.5E —6 95 —~16E—7 
10.0 .09900991 -9.4E ~—9 —1.0E — 6 106 —6.6E — 8 


The truncation errors for (6.10.41) and (6.10.21) are, respectively, 
RKF (6.10.41) : 7,,,(Y) = .00048h° 
RK (6.10.21) : T,,.,(Y) = — .0083h° n>0 


This suggests that the RKF method should generally have a smaller truncation 
error, although in practice, the difference is generally not this great. Note that the 
classical method (6.10.21) with a stepsize A and using the error estimate (6.10.37) 
will require eleven evaluations of f to go from y,(x,,) to y,(x, + 2h). And the 
RFK method (6.10.41) will require twelve evaluations to go from y,(x,) to 
y,(X, + 2h). Consequently, the computational effort in going from x, tox, + 2h 
is comparable, and it is fair to compare their errors by using the same value of h. 


Example Use the RKF method (6.10.41) to solve the problem (6.10.22). It was 
previously an example for the classical RK method (6.10.21). As before, A = .25; 
and the results are given in Table 6.27. The theoretical value for Ratio is again 
16, and clearly it has not yet settled down to that value. As h decreases, it 
approaches 16 more closely. The use of the Richardson extrapolation formula 
(6.10.24) is given in the last column, and it clearly overestimates the error. 
Nonetheless, this is still a useful error estimate in that it gives some idea of the 
magnitude of the global error. 

The method (6.10.41) and (6.10.42) is generally used with local extrapolation, 


_as is illustrated later. The method has been much studied, to see whether 


improvements were possible. Recently, Shampine (1986) has given an analysis 
that suggests some improved RKF formulas, based on several criteria for 
comparing Runge—Kutta formulas. To date, these have not been made a part of a 
high-quality production computer code, although it is expected they will be. 


Automatic Runge—Kutta—Fehlberg programs A variable-stepsize RKF program 
can be written by using (6.10.43) to estimate and control the truncation error in 
the fourth-order formula (6.10.41). Such a method has been written by L. 
Shampine and H. Watts, and it is described in Shampine and Watts (1976a). Its 
general features are as follows. The program is named RKF45, and a user of the 
program must specify two error parameters ABSERR and RELERR. The trunca- 
tion error in (6.10.43) for y,,, 1s forced to satisfy the error per step criterion 


|trunc,] < ABSERR + RELERR*|y, ;| (6.10.44) 
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for each component y,, ; of the computed solution y, of the system of differential 
equations being solved. But then the final result of the computation, to be used in 
further calculations, is taken to be the fifth-order formula 9,,, from (6.10.42) 
rather than the fourth-order formula y,,,. The terms y, on the right side in 
(6.10.41) and (6.10.42) should therefore be replaced by ¥,,. Thus RKF45 is really 


__a fifth-order method, which chooses the stepsize h by controlling the truncation 


error in the fourth-order formula (6.10.41). As before, this is called local 
extrapolation, and it can be shown that the bound (6.10.44) for y,,, will imply 
that y,,, Satisfies a modified error per unit stepsize bound on its truncation 
error. The argument is similar to that given in Shampine and Gordon (1975, p. 
100) for variable-order Adams methods. 

The tests of Shampine et al. (1976) show that RKF45 is a superior RK 
program, one that is an excellent candidate for inclusion in a library of programs 
for solving ordinary differential equations. It has become widely used, and is 
given in several texts [e.g., Forsythe et al. (1977), p. 129]. 

In general, the comparisons given in Enright and Hull (1976) have shown 
RKF methods to be superior to other RK methods. Comparisons with multistep 
methods are more difficult. Multistep methods require fewer evaluations of the 
derivative f(x, y) than RK methods, but the overhead costs per step are much 
greater with multistep than with RK methods. A judgment as to which kind of 
method to use depends on how costly it is to evaluate f(x, y) as compared with 
the overhead costs in the multistep methods. There are other considerations, for 
example, the size of the system of differential equations being solved. A general 
discussion of these factors and their influence is given in Enright and Hall (1976) 
and Shampine et al. (1976). 


Example Solve the problem 


which has the solution 


20 


Me) = Gy ye" 


The problem was solved using RKF45, and values were output at x = 


Table 6.28 Example of Runge-Kutta—Fehlberg program RKF45 


ABSERR = 107? ABSERR = 107° ABSERR = 107? 
x Error NFE Error NFE Error NFE 
4.0 —-14E-5 19 —2.2E—7 43 -—55E—-10 121 
8.0 —3.4E—-5 31 —5.6E —7 79 —-13E-9 229 
12.0 ~3.7E—5 43 -55E-—7 103 —8.6E — 10 312 
16.0 —3.3E-—5 55 —54E-—7 127. —-12E-9 395 


20.0 —1.8E — 6 67 —16E-7 163 —4.3E — 10 503 
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2, 4,6,...,20. Three values of ABSERR were used, and RELERR = 107?” in all 
cases. The resulting global errors are given in Table 6.28. The column headed by 
NFE gives the number of evaluations of f(x, y) needed to obtain the given 
answer. Compare the results with those given in Table 6.15, obtained using the 
yariable-order multistep program DDEABM. 


A version of RKF45 is available that estimates the global error in the 
computed solution, using Richardson error estimation as in (6.10.24). For a 
discussion of this code, called GERK, see Shampine and Watts (1976b), which 
also contains a discussion of the general problem of global error estimation. 


Implicit Runge-Kutta methods We have considered only the explicit RK meth- 
ods, since these are the main methods that are in current use. But there are 
implicit RK formulas. Generalizing the explicit formula (6.10.19) and (6.10.20), 
we consider . ; 


Yn+ = x + hF(x,, Vn h; f) (6.10.45) 


P 
F(x, yh f)= Ly, 
jal 


i=l 


P 
V=f xtah,y+hy BW, j=1,...,p (6.10.46) 


The coefficients { a;, Bis 7;} specify the method. For an explicit method, we have 
B;, = 0 for i > j. 
The implicit methods have been studied extensively in recent years, since some 
- of them possess stability properties favorable for solving stiff differential equa- 
tions. For any order of convergence, there are A-stable methods of that order. In 
this way, the implicit RK methods are superior to the multistep methods, for 
which there are no A-stable methods of order greater than 2. The most widely 
used codes for solving stiff differential equations, at present, are based on the 
backward differentiation formulas introduced in Section 6.9. But there is much 
work on developing similar codes based on implicit RK methods. For an 
introductory survey of this topic, see Aiken (1985, Sec. 3.1). 


6.11 Boundary Value Problems 


Up to this point, we have considered numerical methods for solving only initial 
value problems for differential equations. For such problems, conditions on the 
solution of the differential equation are specified at a single point, called the 
initial point. We now consider problems in which the unknown solution has 
conditions imposed on it at more than one point. Such problems for differential 
equations are called boundary value problems, or BVPs for short. 
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A typical problem that we study is the two-point BVP: 


y=f(x yy) a<x<b 
u(a) u(b)} [ny 
no] + a 0 | = I (6.11.1) 


The terms A and B denote given square matrices of order 2 X 2, and y, and y, 
are given constants. The theory for BVPs such as this one is much more 
complicated than for the initial value problem (see Theorems 6.1 and 6.2 in 
Section 6.1). To illustrate the possible difficulties, we give the following examples. 


Example 1. Consider the two-point linear BVP 
y’=-Ay O<x<l1 
y(0) = y(1) =0 (6.11.2) 
If A is not one of the numbers 


m7, 47, 9r,...,n°a,... (6.11.3) 


then this BVP has the unique solution Y(x) = 0. Otherwise, there are an infinite 
number of solutions, 


¥(x) = Csin(VAx) (6.11.4) 
with A chosen from (6.11.3) and C an arbitrary constant. 
2. Consider the related problem 
| y= —Aytg(x) O<x<1 
y(0) =yQ1) =0 (6.11.5) 
If A is not chosen from (6.11.3), and if g(x) is continuous for a < x < b, then 
this problem has a unique solution Y(x) that is twice continuously differentiable 


on [0,1]. In contrast, if A = 77, then the problem (6.11.5) has a solution if and 
only if g(x) satisfies 


['s(x)sin (ax) dx = 0 
0 
In the case that this is satisfied, the solution is given by 
1 px 
¥(x) = Csin(mx) + =| g(t) sin (a(x — t)) dt (6.11.6) 
7-0 


with C an arbitrary constant. A similar result holds for other A chosen from 
(6.11.3). 


BOUNDARY VALUE PROBLEMS 435 


A few results from the theory of two-point BVPs is now given, to help in 
further understanding them and to aid in the presentation of numerical methods 
for their solution. We begin with the two-point problem for the second-order 
linear equation: 


y= p(x)y’+q(x)yt+e(x) a<x<b 


u(a) u(b)]_ fv (6.11.7) 
a] " ee i Bi 


The homogeneous problem is the case in which g(x) = 0 and y,, y, = 0. 

Theorem 6.10- The nonhomogeneous problem (6.11.7) has a unique solution 
Y(x) on [a, b], for each set of given data { g(x), 71, y2}, if and 
only if the homogeneous problem has only the trivial solution 


¥(x) =0. 


Proof See Stakgold (1979, p. 197). The preceding examples are illustrations of 
the theorem. a 


For conditions under which the homogeneous problem for (6.11.7) has only 
the zero solution, we consider the following more special linear problem: 


y” =p(x)y +q(x)yt+e(x) a<x<b 
ayy(a) — ay'(a) = 7, boy(b) + by’(b) = 2 


(6.11.8) 


In this problem, we say the boundary conditions are separated. Assume the 
following conditions are satisfied: 


g(x)>0 asx<b. 
a)a,>0 bb, 20 (6.11.9) 
lal + Ja,] #0 [ol + [2,] # 0 lao] + [dol # 0 


Then the homogeneous problem for (6.11.8) has only the zero solution, Theorem 
6.10 is applicable, and the nonhomogeneous problem has a unique solution, for 
each set of data { g(x), y,, y.}. For a proof of this result, see Keller (1968, p. 11). 
Example (6.11.5) illustrates this result. It also shows that the conditions (6.11.9) 
are not necessary; the problem (6.11.5) is uniquely solvable for most negative 
choices of q(x) =A. 

The theory for the nonlinear problem (6.11.1) is far more complicated than 
that for the linear problem (6.11.7). We give an introduction to that theory for 
the following more limited problem: 


yr =f(x,y,y)  a<x<b 


(6.11.10) 
ayy(a) - a,y(a) =" byy(b) + byy'(b) = 2 
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The function f is assumed to satisfy the following Lipschitz condition: 


| f(x, u,v) — f(x, U2, v)| Ss K\u, oe ua| 
(6.11.11) 


| f(x, u,v,) — f(x, 4, v2) | < Klv, - v,| 
for all points (x, u;,v), (x, u, u;) in the region 
R= {(x,u,vJasx<b, —c0 <u,v< co} 


This is far stronger than needed, but it simplifies the statement of the following 
theorem and the analysis of the numerical methods given later. 


Theorem 6.11 For the problem (6.11.10), assume f(x, u, v) is continuous on the 
region R and that it satisfies the Lipschitz condition (6.11.11). In 
addition, assume that on R, f satisfies 


Af (x, u,v) 2:6) ee <M _ (6.11.12) 
du dv 


for some constant M> 0. For the boundary conditions of 
(6.11.10), assume 


aya, = 0 bob, = 0 
[ag] + Ja,] #O° = fof + Jb,] HO — Jag] + Jbo| #0 (6.11.13) 
Then the BVP (6.11.10) has a unique solution. 


Proof See Keller (1968, p. 9). The earlier uniqueness result for the linear 
problem (6.11.8) is a special case of this theorem. | 


Nonlinear BVPs may be nonuniquely solvable, with only a finite number of 
solutions. This is in contrast to the situation for linear problems, in which 
nonuniqueness always means an infinity of solutions, as illustrated in 
(6.11.2)-(6.11.6). An example of such nonunique solvability is the second-order 
problem 


f 4 A Si 0 0 1 
(x) + Asin(y) x 
y'(0) = y'(1) = 0, | y(x) | <a - (6.11.14) 


which arises in studying the buckling of a column. The parameter A is propor- 
tional to the load on the column; when A exceeds a certain size, there is a 
solution to the problem (6.11.14) other than the zero solution. For further detail 
on this problem, see Keller and Antman (1969, p. 43). 

_ As with the earlier material on initial value problems [see (6.1.11)-(6.1.15) in 
Section 6.1], all boundary value problems for higher order equations can be 
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_ reformulated as problems for systems of first-order equations. The general form 


of a two-point BVP for a system of first-order equations is 


y =f(x,y) a<x<b 
(6.11.15) 


Ay(a) + By(b) =y 


This represents a system of n first-order equations. The quantities y(x), f(x,y), 
and y are vectors with n components, and A and B are matrices of order n X n. 
There is a theory for such BVPs, analogous to that for the two-point problem 
(6.11.1), but for reasons of space, we omit it here. © 

In the remainder of this section, we describe briefly the principal numerical 
methods for solving the two-point BVP (6.11.1). These methods generalize to 
first-order systems such as (6.11.15), but for reasons of space, we omit those 
results. Much of our presentation follows Keller (1968), and a theory for 
first-order systems is given there. Unlike the situation with initial value problems, 
it is often advantageous to directly treat higher order BVPs rather than to 
numerically solve their reformulation as a first-order system. The numerical 
methods for the two-point problem are also less complicated to present, and 
therefore we have opted to discuss the two-point problem rather than the system 
(6.11.15). 


Shooting methods One of the very popular approaches to solving a two-point 
BVP is to reduce it to.a problem in which a program for solving initial value 


problems can be used. We now develop such a method for the BVP (6.11.10). 
Consider the initial value problem 


yr=f(xy,y)  a<x<b 
ya) = as — oy, y'(a) = ays — cory (6.11.16) 
depending on the parameter s, where co, c,, are arbitrary constants satisfying 
aly — age, = 1 
Denote the solution of (6.11.16) by Y(x; s). Then it is straightforward to see that 
ao¥(a;s) — a,Y’(a;s) =, 

for all s for which Y exists. 

Since Y is a solution of (6.11.1), all that is needed for it to be a solution of 
(6.11.10) is to have it satisfy the remaining boundary condition at b. This means 
that Y(x; s) must satisfy 


p(s) = boY(b;.s) + b,Y’(b; s) — y, =0 (6.11.17) 


This is a nonlinear equation for s. If s* is a root of p(s), then Y(x; s*) will satisfy 
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the BVP (6.11.10). It can be shown that under suitable assumptions on f and its 
boundary conditions, the equation (6.11.17) will have a unique solution s* [see 
Keller (1968), p. 9]. We can use a rootfinding method for nonlinear equations to 
solve for s*. This way of finding a solution to a BVP is called a shooting method. 
The name comes from ballistics, in which one attempts to determiné the needed 
initial conditions at x = a in order to obtain a certain value at x = b. 

Any of the rootfinding methods of Chapter 2 can be applied to solving 
(s) = 0. Each evaluation of p(s) involves the solution of the initial value 
problem (6.11.16) over [a, 5], and consequently, we want to minimize the number 
of such evaluations. As a specific example of an important and rapidly conver- 
gent method, we look at Newton’s method: 


7(S,) 


. m=0,1,... (6.11.18) 
9'(s) 


Sm+i — 5m 7 


To calculate p’(s), differentiate the definition (6.11.17) to obtain 


p(s) = bof,(b) + 5,8, (b) (6.11.19) 
where 
dY(x; 5) 
é(x) = — — (6.11.20) 


To find ~,(x), differentiate the equation 
Y"(x; 5) = f(x, Y(x; 5), Y'(x;5)) 
with respect to s. Then &, satisfies the initial value problem 
Ef(x) = A(x, ¥05 5), Cs s)) E(x) 
| +f;(x, ¥(x; 5), (x; s))&(x) (6.11.21) 
E(a)=a, £(a) = ao 


The functions f, and f, denote the partial derivatives of f(x, u,v) with respect 
to u and v, respectively. The initial values are obtained from those in (6.11.16) and 
from the definition of &. 

In practice we convert the problems (6.11.16) and (6.11.21) to a system of four 
first-order equations with the unknowns Y, Y’, §,, and &°. This system is solved 
numerically, say with a method of order p and stepsize h. Let y,(x; 5) denote 
the approximation to Y(x; 5), with a similar notation for the remaining un- 
knowns. From earlier results for solving initial value problems, it can be shown 
that these approximate solutions will be in error by O(h?). With suitable 
assumptions on the original problem (6.11.10), it can then be shown that the root 
s* obtained will also be in error by O(A”), and similarly for the approximate 
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solution y,(x; sf) when compared to the solution Y(x; s*) of the boundary value 
problem. For the details of this analysis, see Keller (1968, pp. 47-54). 


Example Consider the two-point BVP 


, 2(»’) 


-l<x<l 


(6.11.22) 


y(—1) = y(1) = (e + e7!)* = 324027137 


The true solution is ¥(x) = (e* + e~*)~!. The initial value problem (6.11.15) for 
the shooting method is . 


2y') 
( y) -l<x<l 
y (6.11.23) 


5 ell 


y(-l=(e+et)*  y(-1)=s 


The associated problem (6.11.21) for ,(x) is 


fy = -1 -3* 
Jy 


(=e Gate t 


t 


2 
ke 44-8 
y 


(6.11.24) 


The equation for £” uses the solution Y(x; s) of (6.11.23). The function p(s) for 


- . computing s* is given by 


p(s) = ¥(1;s)—(e+e")” 
For use in defining Newton’s method, we have 
p(s) = €(1) 


From the true solution Y of (6.11.22) and the condition y’(—1) = s in (6.11.23), 
the desired root s* of p(s) is simply 


e-—e!} 
s* = Y(-1) = ae 245777174 


{e+e} 


To solve the initial value problem (6.11.23)—(6.11.24), we used the second-order 
Runge-Kutta method (6.10.9) with a stepsize of h = 2/n. The results for several 
values of n are given in Table 6.29. The solution of (6.11.24) is denoted by 
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Table 6.29 Shooting method for solving (6.11.22) 


n= : s* — sf Ratio E, Ratio 
4 “4.01E - 3 2.83E — 2 

8 ; 1.52E — 3 2.64 7.30E — 3 3.88 

16 4.64E — 4 3.28 1.82E — 3 4.01 

32 1.27E — 4 3.64 4.54E - 4 4.01 

64 3.34E — 5 3.82 1.14E - 4 4.00 


y,(x; 5), and the resulting root for 


p,(s) =y,(13s) — (e+e) =0 


is denoted by s¥. For the error in y,(x; sf), let 
E,= Max |¥(x;) — Yn(x55 Sf) 


where {x,} are the node points used in solving the initial value problem. The 
columns labeled Ratio give the factors by which the errors decreased when n was 
doubled (or h was halved). Theoretically these factors should approach 4 since 
the Runge-Kutta method has an error of O(h*). Empirically, the factors 
approach 4.0, as expected. For the Newton iteration (6.11.18), 55) = .2 was used 
in each case. The iteration was terminated when the test 


[S25 = Sal = 10-*° 


was satisfied. With these choices, the Newton method needed six iterations in 
each case, except that of n = 4 (when seven iterations were needed). However, if 
Sq = 0 was used, then 25 iterations were needed for the n = 4 case, showing the 
importance of a good choice of the initial guess sp. 


There are a number of problems that can arise with the shooting method. 
First, there is no general guess s, for the Newton iteration, and with a poor 
choice, the iteration may diverge. For this reason, the modified Newton method 
of (2.11.11)-(2.11.12) in Section 2.11 may be needed to force convergence. A 
second problem is that the choice of y,(x; 5) may be very sensitive to h, s, and 
other characteristics of the boundary value problem. For example, if the lineari- 
zation of the initial value problem (6.11.16) has large positive eigenvalues, then 
the choice of Y(x; 5) is likely to be sensitive to variations in s. For a thorough 
discussion of these and other problems, see Keller (1968, chap. 2), Stoer and 
Burlirsch (1980, Sec. 7.3), and Fox (1980, p. 184-186). Some of these problems 
are more easily examined for linear BVPs, such as (6.11.8), as is done in Keller 
(1968, Chap. 2). 
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Finite difference methods We consider the two-point BVP 


yi =f(x,y,y’) a<x<b 
(6.11.25) 
yaj=y, yl(b)=y, 


with the true solution denoted by Y(x). Let n > 1, h =(b- a)/n, x;=a+ jh 
for j = 0,1,...,”. At each interior. node point x,, 0 <i< xn, approximate 
Y"(x;) and Y’(x;): 


¥..-2Y,+Y_, 
WG) Se ere) 


(6.11.26) 


2 


Yin 7 Yur A 
¥(x,) = BL - = v%(n,) 


for some x;_, < §;, 1; S$ Xi41, = 1,..., — 1. Dropping the final error terms 


and using these approximations in the differential equation, we have the ap- 
proximating nonlinear system: 


Yin — 2,4 Vir 
h2 


= f(x: Ys tn) f=1,...,2-1 (6.11.27) 


This is a system of m — 1 nonlinear equations in the n — 1 unknowns )y,,..., y,—3- 
The values yy = y, and y, = y, are known from the boundary conditions. 

The analysis of the error in { y,} compared to { Y(x,)} is too complicated to be 
given here, as it requires the methods of analyzing the solvability of systems of 
nonlinear equations. In essence, if Y(x) is four times differentiable, if the 
problem (6.11.25) is uniquely solvable for some region about the graph on [a, b] 
of Y(x), and if f(x, u, v) is sufficiently differentiable, then there is a solution to 
(6.11.27) and it satisfies 


Max |¥(x;) — »;| = 0(A”) (6.11.28) 


For an analysis, see Keller (1976, Sec. 3.2) or Keller (1968, Sec. 3.2). Moreover, 
with additional assumptions on f and the smoothness of Y, it can be shown that 


¥(x,) — y, = 7(x,)h2 + O(h4) (6.11.29) 


with r(x) independent of h. This can be used to justify Richardson extrapola- 
tion, to obtain results that converge more rapidly. [Other methods to improve 
convergence are based on correcting for the error in the central difference 
approximations of (6.11.27).] , 

The system (6.11.27) can be solved in a variety of ways, some of which are 
simple modifications of the methods described in Sections 2.10 and 2.11. In 
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matrix form, we have 
| -2 1 0 0 Vy 

1 -2 1 y2 
nh 
1 -2. 1 |} 
| O os: 0 1-2 4ty,-1 
| Jy2— Yo v1 
i Xi, Jy» 2h re 
37M 
| ries Bo) 2h = 0 
Vn Yn-2 Y2 
ee Yn-1s rh he 
which we denote by 
1 

— Ay =f(y)+¢ (6.11.30) 


h2 
The matrix A is nonsingular [see Theorem 8.2, Chapter 8]; and linear systems 


Az = b are easily solvable, as described preceding Theorem 8.2 in Chapter 8. 
Newton’s method (see Section 2.11) for solving (6.11.30) is given by 


1 “1 1 P 
ye yO Fe ae Fy™)| aay ity") i= 4 (6.11.31) 


| _ 

| Fy)=|5-| 1sijsn-1 

| dy, 

} 

| The Jacobian matrix simplifies considerably because of the special form of f(y): 
| a(x,» 4) 

F(y)| 4; = ——————__—— 

| [FQ], 7 

| This is zero unless j = i — 1, i, or i + 1: 

[Fa =A x y, 1 | Lei<n= 1 


=a | Yi+1 7 Vi-1 ; 
[F(y)]ii-a = sc hlx y, He} 2<i<n-1 
at) Veveyed 


Roles 
(y) ii+1 = shal x Io th 
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with f,(x, u,v) and f,(x, u, v) denoting partial derivatives with respect to u and 
v, respectively. Thus the matrix being inverted in (6.11.31) is of the special form 
we call tridiagonal. Letting 


B,, = <A — Fly”) (6.11.32) 


we can rewrite (6.11.31) as 


ym) = yl) §(m) 


1 _ (6.11.33) 
Bb = pay _ f(y‘”) ~g 


This linear system is easily and rapidly solvable, as shown in Section 8.2 of 
Chapter 8. The number of multiplications and divisions can be shown to about 
5n, a relatively small number of operations for solving a linear system of n — 1 
equations. Additional savings can be made by not varying B,, or by only 
changing it after several iterations of (6.11.33). For an extensive survey and 
discussion of the solution of nonlinear systems that arise in connection with 
solving BVPs, see Deuflhard (1979). 


Example We applied the preceding finite difference procedure (6.11.27) to the 
solution of the BVP (6.11.22), used earlier to illustrate the shooting method. The 
results are given in Table 6.30 for successive doublings of n = 2/h. The nonlin- 
ear system in (6.11.27) was solved using Newton’s method, as described in 
(6.11.33). The initial guess was 


yO(x,) =(et et) §=0,1,...,0 
based on connecting the boundary values by a straight line. The quantity 


d,= Max [pee a yf. 


O<sisn 
was computed for each iterate, and when the condition 


d, < 10-° 


Table 6.30 The finite difference method 


for solving (6.11.22) 
aes E, Ratio 
4 .2.63E — 2 
8 “5.87E — 3 4.48 
16 1.43E — 3 4.11 
32 3.55E — 4 4.03 


64 8.86E—-5 - 4.01 
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was Satisfied, the iteration was terminated. In. all cases, the number of iterates 
computed was 5 or 6. For the error, let 


E, = Max |Y(x;) — y,(x;)| 
O<isn 


with y, the solution of (6.11.27) obtained with Newton’s method. According to 
(6.11.28) and (6.11.29), we should expect the values E, to decrease by about a 
factor of 4 when h is halved, and that is what we observe in the table. ~ 


Higher order methods can be obtained in several ways. (1) Use higher order 
approximations to the derivatives, improving (6.11.26); (2) use Richardson ex- 
trapolation, based on (6.11.29); as with Romberg integration, it can be repeated 
to obtain methods of arbitrarily high order; (3) the truncation errors in (6.11.26) 
can be approximated with higher order differences using the calculated values of 
y,- Using these values as corrections to (6.11.27), we can obtain a new more 
accurate approximation to the differential equation in (6.11.25), leading to a more 
accurate solution. All of these techniques have been used, and some have been 
implemented as quite sophisticated computer codes. For a further discussion and 
for examples of computer codes, see Fox (1980, p. 191), Jain (1984, Chap. 4), and 
Pereyra (1979). 


Other methods and problems There are a number of other methods used for 
solving boundary value problems. The most important of these is probably the 
collocation method. For discussions referring to collocation methods, see Reddien 
(1979), Deufihard (1979), and Ascher and Russell (1985). For an important 
collocation computer code, see Ascher et al. (1981a) and (1981b). 

Another approach to solving a boundary value problem is to solve an 
equivalent reformulation as an integral equation. There is much less development 
of such numerical methods, although they can be very effective in some situa- 
tions. For an introduction to this approach, see Keller (1968, Chap. 4). 

There are also many other types of boundary value problems, some containing 
some type of these singular behavior, that we have not discussed here. For all of 
these, see the papers in the proceedings of Ascher and Russell (1985), Aziz 
(1975), Childs et al. (1979), and Gladwell and Sayers (1980); also see Keller 
(1976, Chap. 4) for singular problems. For discussions of software, see Childs 
et al. (1979), Gladwell and Sayers (1980), and Enright (1985). 


Discussion of the Literature 


Ordinary and partial differential equations are the principal form of mathemati- 
cal model occurring in the sciences and engineering, and consequently, the 
‘numerical solution of differential equations is a very large area of study. Two 
classical books that reflect the state of knowledge before the widespread use of 
digital computers are Collatz (1966) and Milne (1953). Some important and 
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general books, since 1960, in the numerical solution of ordinary differential 
equations are Henrici (1962), Gear (1971), Lapidus and Seinfeld (1971), Lambert 
(1973), Stetter (1973), Hall and Watt (1976), Shampine and Gordon (1975), 
Van der Houwen (1977), Ortega and Poole (1981), and Butcher (1987). A useful 
survey is given in Gupta et al. (1985). 

The modern theory of convergence and stability of multistep methods, intro- 
duced in Section 6.8, dates from Dahlquist (1956). An historical account is given 
in Dahlquist (1985). The text by Henrici (1962) has become a classic account of 
that theory, including extensions and applications of it. Gear (1971) is a more 
modern account of all methods, especially variable order methods. Stetter (1973) 
gives a very general and complete abstract analysis of the numerical theory for 


_ solving initial value problems. A complete account up to 1970 of Runge-Kutta 


methods, their development and error analysis, is given in Lapidus and Seinfeld 
(1971). Hall and Watt (1976) gives a survey of all aspects of the solution of 
ordinary differential equations, including the many special topics that have 
become of greater interest in the past ten years. 

The first significant use of the concept of a variable order method is due to 
Gear (1971) and Krogh (1969). Such methods are superior to a fixed-order 
multistep method in efficiency, and they do not require any additional method 
for starting the integration or for changing the stepsize. A very good account of 
the variable-order Adams method is given in Shampine and Gordon (1975) 
and the excellent code DE/STEP is included. Other important early codes based ‘ 
on the Adams family of formulas were those in Krogh (1969), DIFSUB from 
Gear (1971), and GEAR from Hindmarsh (1974). The latter program GEAR has 
been further developed into a large multifunction package, called ODEPACK, 
and it is described in Hindmarsh (1983). Variants of these codes and other 
differential equation solvers are available in the IMSL and NAG libraries. 

Runge-Kutta methods are a continuing active area of theoretical research and 
program development, and a very general development is given in Butcher (1987). 
New methods are being developed for nonstiff problems; for example, see 
Shampine (1986) and Shampine and Baca (1986). There is also great interest in 
implicit Runge-Kutta methods, for use in solving stiff differential equations. For 
a survey of the latter, see Aiken (1985, pp. 70-92). An important competitor to 
the code RKF45 is the code DVERK described in Hull et al. (1976). It is based 
on a Fehlberg-type scheme, with a pair of formulas of orders 5 and 6. 

A third class of methods has been ignored in our presentation, those based on 
extrapolation. Current work in this area began with Gragg (1965) and Bulirsch 
and Stoer (1966). The main idea is to perform repeated extrapolation on some 
simple method, to obtain methods of increasingly higher order. In effect, this 
gives another way to produce variable-order methods. These methods have 
performed fairly well in the tests of Enright and Hull (1976) and Shampine et al. 
(1976), but they were judged to not be as advanced in their practical and 
theoretical development as are the multistep and Runge-Kutta methods. For a 
recent survey of the area, see Deuflhard (1985). Also, see Shampine and Baca 
(1986), in which extrapolation methods are discussed as one example of variable 
order Runge—Kutta methods. 

Global error estimation is an area in which comparatively little has been 
published. For a general survey, see Skeel (1986). To our knowledge, the only 
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running computer code is GERK from Shampine and Watts (1976a). Additional 
work is needed in this area. Many users of automatic packages are under the 
mistaken impression that the automatic codes they are using are controlling the 
global error, but such global error control is not possible in a practical sense. In 
many cases, it would seem to be important to have some idea of the actual 
magnitude of the global error present in the numerical solution. 

Boundary value problems for ordinary differential equations are another 
important topic; but both their general theory and their numerical analysis are 
much more sophisticated than for the initial value problem. Important texts are 
Keller (1968) and (1976), and the proceedings of Child et al. (1979) gives many 
important papers on producing computer codes. For additional papers, see the 
collections of Aziz (1975), Hall and Watt (1976), and Ascher and Russell (1985). 
This area is still comparatively young relative to the initial value problem for 
nonstiff equations. The development of computer codes is proceeding in a 
number of directions, and some quite good codes have been produced in recent 
years. More work has been done on codes using the shooting method, but there 
are also excellent codes being produced for collocation and finite difference 
methods. For some discussion of such codes, see Enright (1985) and Gladwell 
and Sayers (1980, pp. 273-303). Some boundary value codes are given in the 
IMSL and NAG libraries. 

Stiff differential equations is one of several special areas that have become 
much more important in the past ten years. The best general survey of this area is 
given in Aiken (1985). It gives examples of how such problems arise, the theory 
of numerical methods for solving stiff problems, and a survey of computer codes 
that exist for their solution. Many of the other texts in our bibliography also 
address the problem of stiff differential equations. We also recommend the paper 
of Shampine and Gear (1979). 

Equations with a highly oscillatory solution occur in a number of applications. 
For some discussion of this, see Aiken (1985, pp. 111-123). The method of lines 
for solving time-dependent partial differential equations is a classical procedure 
that has become more popular in recent years. It is discussed in Aiken (1985, pp. 
124-138), Sincovec and Madsen (1975), and Melgaard and Sincovec (1981). 

Yet another area of interest is the solution of mixed systems of differential and 
algebraic equations (DAEs). This refers to systems in which there are n un- 
knowns, m <n differential equations, and n — m algebraic equations, involving 
the n unknown functions. Such problems occur in many areas of applications. 
One such area of much interest in recent years is that of computer aided design 
(CAD). For papers applicable to such problems and to other DAEs, see 


_ Rheinboldt (1984) and (1986). 


Because of the creation of a number of automatic programs for solving 
differential equations, several empirical studies have been made to assess their 
performance and to make comparisons between programs. Some of the major 
comparisons are given in Enright and Hull (1976), Enright et al. (1975), and 
Shampine et al. (1976). It is clear from their work that programs must be 
compared, as well as methods. Different program implementations of the same 
method can vary widely in their performance. No similar results are known for 
comparisons of boundary value codes. 
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Problems 

1. Draw the direction field for y’ = x — y?, and then draw in some sample 
solution curves. Attempt to guess the behavior of the solutions Y(x) as 
xX > 0. : 

2. Determine Lipschitz constants for the following functions, as in (6.1.2). 
(a) f(x, y)=2y/x, x21 
@) fle y) = tan'(y) 
(c) f(x, y) = (x? — 2)"/(17x? + 4) 
@) f(x y)=x-y% |p| <10 


3. Convert the following problems to first-order systems. 


(a) yp” —3y’+2y=0, pO =1, y'(0) =1 
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(b) py” ~ 10 - y?)y’ + y =0, yp) = 1, (0) = 0 
(Van der Pol’s equation) 


(ce) x(t) = —x/r?, y(t) = —y/r?, r= yx? + y? 
(orbital equation) 
x(0) = .4, x'(0) = 0, y(0) = 0, y(0) = 2 


Let Y(x) be the solution, if it exists, to the initial value problem (6.0.1). By 
integrating, show that Y satisfies 


¥(x) = Yo+ fos(t, ¥(2)) at 


Conversely, show that if this equation has a continuous solution on the 
interval x, < x <b, then the initial value problem (6.0.1) has the same 
solution. 


The integral equation of Problem 4 is solved, at least in theory, by using the 
iteration 


Y(t) =Yo+ [f(t Y(t) dt, xp <x<b 
Xo 


for m= 0, with Y)(x) = Yo. This is called Picard iteration, and under 
suitable assumptions, the iterates {Y,,(x)} can be shown to converge 
uniformly to Y(x). Illustrate this by computing the Picard iterates Y,, Y5, 
and Y; for the following problems; compare them to the true solution Y. 


(a) y’=—-y, y(0) =1 
(b) y= ~xy, yl) = 2 
(c) yp’ =y + 2cos(x), y(0)=1 


Write a computer program to solve y’ = f(x, y), y(Xo) = Yo, using Euler’s 
method. Write it to be used with an arbitrary f, stepsize 4, and interval 
[xo,b]. Using the program, solve y’=x?—y, y(0)=1, forO<x <4, 
with stepsizes of h = .25, .125, .0625, in succession. For each value of h, 
print the true solution, approximate solution, error, and relative error at the 
nodes x = 0, .25, .50, .75,..., 4.00. The true solution is ¥(x) = x? — 2x + 
2 — e-*. Analyze your output and supply written comments on it. Analysis 
of output is as important as obtaining it. 


For the problem y’=y, y(0) = 1, give the explicit solution {y,} for 
Euler’s approximation to the equation. Use this to show that for x, = 1, 
YQ) — y, = (h/2)e ash > 0. 
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8. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


Consider solving the problem 

yor-y x20 y(0)=1 
by Euler’s method. Compare (1) the bound (6.2.13), using K = 2 as the 
Lipschitz constant, and (2) the asymptotic estimate from (6.2.36). The true 
solution is ¥(x) = 1/(1 + x). 


Show that Euler’s method fails to approximate the solution Y(x) = (3x)°/?, 
x = 0, of the problem y’ = y'/7, y(0) = 0. Explain why. 


For the equations in Problem 3, write out the approximating difference 
equations obtained by using Euler’s method. 


Recall the rounding error example for Euler’s method, with the results 
shown in Table 6.3. Attempt to produce a similar behavior on your 
computer by letting 4 become smaller, until eventually the error begins to 
increase. 
Convert the problem 
y+ dy” + Sy’ + 2y = —4sin(x) — 2cos(x) 
y0)=1  y(0)=0  y"(0)= -1 


to a system of first-order equations. Using Euler’s method, solve this system 


and empirically study the error. The true solution is Y(x) = cos (x). 


(a) Derive the Lipschitz condition (6.2.52) for a system of two differential 
equations. % 


(b) Prove that the method (6.2.51) will converge to the solution of 
(6.2.50). 


Consider the two-step method 


] h , t 4 , 
Yar = (Yn + Yo-i) + ql4vner — In + 3Yn—1] nz 


with y, = f(x,, y,)- show it is a second-order method, and find the leading 
term in the truncation error, written as in (6.3.15). 


Assume that the multistep method (6.3.1) is consistent and that it satisfies 
a,>=0, j=0,1,..., p, the same as in Theorem 6.6. Prove stability of 
(6.3.1), in analogy with (6.2.28), for Euler’s method, but letting 6 = 0. 

Write a program to solve y’ = f (x, y), y(Xo) = Yo, using the midpoint rule 
(6.4.2). Use a fixed stepsize h. For the initial value y,, use the Euler 


17. 


18. 


19. 


20. 


21. 
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method: 
Vy =Yot hf (xo, Yo) 


With the program, solve the following problems: 


(a) y’ = —y’, y(0) =1; ¥(x) = 1/0. + x) 
eA aie ae = a 
I e At 7 at YOY Ts Hay = [1 + 19e-*/4] 
(ec) y’ = —y + 2co0s(x), y(0) = 1; Y(x) = cos(x) + sin(x) 
(dd) y’=y—2sin(x), y(0) = 1; Y(x) = cos(x) + sin(x) 


Solve on the interval [x9, b] = [0,5], with 4 = .5, .25. Print the numerical 
solution y, = y,(x,,) at each node, along with the true error. Discuss your 
results. 


Write a program to solve y’ = f(x, y), y(X9) = yo, using the trapezoidal 
rule (6.5.2) with a fixed stepsize h. In the iteration (6.5.3), use the midpoint 
method as the predictor of y,. Allow the number of iterates J to be an 
input variable, J > 1. Solve the equations in Problem 16 with h = .5S and 
.25, for [xo, b] = [0,10]. Solve over the entire interval with J = 1, then 
repeat the process with J = 2, and then J = 3. Discuss the results. How 
does the total error vary with J? 


Derive the asymptotic error formula (6.5.19) for the trapezoidal method. 


Write a program to implement the algorithm Detrap of Section 6.6. Using 
it, solve the equations given in Problem 16, with « = .001. 


Use the quadratic interpolant to Y(x) = f(x, Y(x)) at x,,xX,~1,X,—2 to 
obtain the formula 


4h 28 
bs Gene | = Y,-3 + 3 ey = Yi-1 = 2¥/_] + mY CEs) 


When the truncation error is dropped, we obtain the method 


4h 


Ynt1 =In-3 + Ze — I-12 + 2-2] 1 B3 


This is the predictor formula for the Milne method (6.7.13). 


Show that the Simpson method (6.7.13) is only weakly stable, in the same 


"sense as was true of the midpoint method in Section 6.4. 


454 NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS 


22. 


25. 


Write a program to solve y’ = f(x, y), y(%g) =¥, Xo < x <b, using the 
fourth-order Adams~Moulton formula and a fixed stepsize A. Use the 
fourth-order Adams-—Bashforth formula as the predictor. Generate 
the initial values y,, y, y; using true solution Y(x). Solve the equations of 
Problem 16 with h = .5 and h = .25, Print the calculated answers and the 
true errors. For comparison with method (6.7.27), also solve the example 
used in Table 6.14. Check the accuracy of the Richardson extrapolation 
error estimate 


Y(x) — y,(x) = tel ya(x) — Yor(x)] 


that is based on the global error being of the fourth order. Discuss all your 
results. 


(a) For the coefficients y, and 6, of the Adams—Bashforth and 
Adams-—Moulton formulas, show that 6; = y; — y;_;, i= 1. 


(b) For the p-step Adams—Moulton formula (6.7.26), prove 


Vn+i = 9) ot hy,vP*? mal (1) 
with y, the p + 1 step Adams—Bashforth formula from (6.7.22), 
P - 
yO, =y, th LU yy, (2) 
j=0 


These formulas are both of order p + 1. The result (1) is of use in 
calculating the corrector from the predictor, and is based on carrying 
certain backward differences from one step to the next. There is a 
closely related result when a p-step predictor (order p) is used to 
solve the p-step corrector (order p + 1) [see Shampine and Gordon 
(1975), p. 51]. 


Consider the model equation (6.8.16) with A a square matrix of order m. 
Assume A = P~!DP, with D a diagonal matrix with entries \,...,A,,- 
Introduce the new unknown vector function z = P~'y(x). Show that 
(6.8.16) converts to the form given in (6.8.17), demonstrating the reduction 
to the one-dimensional model equation. 


Following the ideas given in Sections 6.4 and 6.8, give the general proce- 
dure for solving the linear difference equation 


Yn+1 = 4oYn + Q4Vn-1 


Apply this to find the general solution of the following equations. 


(a) dnt = ay + dy, 3 


(b) Yn =In — WYn—1 Hint: See the formula following (6.8.22). 


26. 


27. 


29. 


31. 


32. 
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Solve the third-order linear difference equation 
Une =U, tc(u,-~—Upz-2) n22 = O<cx<i 
with Uo, u,, U2 given. What can be said about 
Limit w,, 


noo 


Consider the numerical method 
Yn+1 = Ayer 3Yn—1 ae 2hf (Xp Yn-1) n2 1 
Determine its order. Illustrate with an example that the method is unstable. 


Show that the two-step method 


Yat = 2p ~ In + ALS, + dys a] 


is of order 2 and unstable. Also, show directly that it need not converge 
when solving y’ = f(x, y). 


Complete part (1) of the proof of Theorem 6.7, in which the root condition 
is violated by assuming |7,| = 1, p’(r;) = 0, for some j. 


For part (2) of the proof of Theorem 6.8, show that y,{7,(hA)]" > 0 as 
h->0,1<j<p. 


(a) Determine the values of a, in the explicit second-order method 
(6.7.10) for which the method is stable. 


(b) If only the truncation error (6.7.11) is considered, subject to the 
stability restriction in part (a), how should ap) be chosen? 


(c) To ensure a large region of stability, subject to part (a), how should ay 
be chosen? 


(a) Find the general formula for all two-step third-order methods. These 
will be a one-parameter family of methods, say, depending on the 
coefficient a,. 


(b) What are the restrictions on a, for the method in part (a) to be 
stable? 


(c) If the truncation error is written as 


T,+1(¥) = BR*Y(x,) + O(h?) 
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33. 


35. 


37. 


38. 


39. 


40. 


give a formula for B in terms of a,. (It is not necessary to construct a 
Peano kernel or influence function for the method.) How should a, be 
chosen if the truncation error is to be minimized, subject to the 
stability restriction from part (b)? 


(d) Consider the region of absolute stability for the methods of part (a). 
What is this region for the method of part (c) that minimizes the 
truncation error coefficient 8? Give another value of a, that gives a 
Stable method and that has a larger region of absolute stability. 
Discuss finding an optimal region by choosing a, approximately. 


(a) Find all explicit fourth-order formulas of the form 
Yn = Yn + M4Yn—1 + 22Vq—2 + A[ Bo Yy + by Yy_a + by yj] n B2 
(b) Show that every such method is unstable. 


Derive an implicit fourth-order multistep method, other than those given in 
the text. Make it be relatively stable. 


For the polynomial p(r) = r?** — Lgajr?~/, assume a, > 0, 0 <j <p, 
and 2a, = 1. Show that the roots of pcr) will satisfy the root condition 
(6.8.9) and (6.8.10). This shows directly that Theorem 6.6 is a corollary of 
Theorem 6.8. 


(a) Consider methods of the form 


P 
Vn+1 = Vn—q oh a Bf (x enjs Ya-3) 


jon 


with g > 1. Show that such methods do not satisfy the strong root 
condition. As a consequence, most such methods are only weakly 
stable. 


(b) Find an example with g = 1 that is relatively stable. 


Show that the region of absolute stability for the trapezoidal method is the 
set of all complex hd with Real (A) < 0. 


Use the backward Euler method to solve the problem (6.8.51). Because the 
equation is linear, the implicit equation for y,,, can be solved exactly. 
Compare your results with those given in Table 6.17 for Euler’s method. 


Repeat Problem 38 using the second-order BDF formula (6.9.6). To find y, 
for use in (6.8.6), use the backward Euler method. 


Recall the model equation (6.8.50) where it is regarded as a perturbation of 
Euler’s method (6.8.46). For the special case Y”(x) = constant, analyze the 


41. 


42. 


43. 


45. 
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behavior of the error Y(x,,) — y, as it depends on A. Show that again the 
condition (6.8.49) is needed in order that the error be well-behaved. 


Derive the truncation error formula (6.9.7) for backward differentiation 
formulas. 


For solving y’ = f(x, y), consider the numerical method 


h 2 
Yat. = Yn + 3b + rai] + 7p bn =a n>0 
Here y, = f(Xn» Yn) 
af (x, Yn) Of (xn In) 
a = LL + pi EE tal 
Vn a5 F(Xns In) ay 


with this formula based on differentiating Y’(x) = f(x, Y(x)). 
(a) Show that this is a fourth-order method: T,(Y) = O(h?). 


(b) Show that the region of absolute stability contains the entire negative 
real axis of the complex AA-plane. 


Generalize the method of lines, given in (6.9.23)-(6.9.25), to the problem 
U, = a(x, t)U,, + G(x, t,U(x,t)) O<x<1 1>0 
U(0,r)=d,(t) U,t)=4d,(t) +¢20 
U(x,0)=f(x) > O<x<1 


For it to be well-defined, we assume a(x, n>0,0<x<1,120. 


(a) If you have a solver of tridiagonal linear algebraic systems available to 
you, then write a program to implement the method of lines for the 
problem (6.9.19)—(6.9.21). The example in the text, with the unknown 
(6.9.35), was solved using the backward Euler method. Now imple- 
ment the method of lines using the trapezoidal rule. Compare your 
results with those in Table 6.20 for the backward Euler method. 


(b) Repeat with the second-order BDF method. 


Derive a third-order Taylor series method to solve y’ = —y”. Compare the 
numerical results to those in Table 6.22. 


Using the Taylor series method of Section 6.10, produce: a fourth-order 
method to solve y’ = x — y”, y(0) = 0. Use fixed stepsizes, h = .5, .25, 
-125 in succession, and solve for 0 < x < 10. Estimate the global error 
using the error estimate (6.10.24) based on Richardson extrapolation. 


49. 


51. 


52. 


47. 
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Write a program to solve y’ = f(x, y), y(X9) = Yo, using the classical 
Runge-Kutta method (6.10.21), and let the stepsize A be fixed. 


(a) Using the program, solve the equations of Problem 16. 


(b) Solve y'=x-—y?, y(0)=0, for h =.5, .25, .125. Compare the 
results with those of Problem 46. 


Consider the three stage Runge-Kutta formula 
Yaar =n th + + Ms] 
V, = f(x,, Yas Vo = f(x, + ah, y, + hByV;) 
Vz = f(x, + a5h, y, + h(BsV, + Bs2%2)) 


Determine the set of equations that the coefficients {y,, a,, 8;;} must satisfy 
if the formula is to be of order 3. Find a particular solution of these 
equations. 


Prove that if the Runge-Kutta method (6.10.4) satisfies (6.10.27), then it is 
stable. 


Apply. the classical Runge—Kutta method (6.10.21) to the test problem 
(6.8.51), for various values of A and h. For example, try A = —1, —10, 
—50 and h = .5, .1, .01, as in Table 6.17. 


Calculate the real part of the region of absolute stability for the 
Runge-Kutta method of (a) (6.10.8), (b) (6.10.9), (c) (6.10.21). We are 
interested in the behavior of the numerical solution for the differential 
equation y’=Ay with Real(A) < 0. In particular, we are interested in 
those values of hA for which the numerical solution tends to zero as 
X, 7? ©. 


(a) Using the Runge-Kutta method (6.10.8), solve 
y= -ytxI[114+ x] y(0) =0 


whose solution is Y(x) = x). Solve the equation on [0,5], printing 
the errors at x = 1,2,3,4,5. Use stepsizes h = .1, .05, .025, .0125, 
.00625. Calculate the errors by which the errors decrease when h is 
halved. How does this compare with the usual theoretical rate of 
convergence of O(h?)? Explain your results. 


(b) What difficuity arises when trying to use a Taylor method of order 
> 2 to solve the equation of part (a)? What does it tell us about the 
solution? 


53. 


55. 


57. 
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Convert the boundary value problem (6.11.1) to an equivalent boundary 
value problem for a system of first-order equaticns, as in (6.11.15). 


(a) Consider the two-point boundary value problem (6.11.25). To convert 
this to an equivalent problem with zero boundary conditions, write 
y(x) = 2(x) + w(x), with w(x) a straight line satisfying the follow- 
ing boundary conditions: w(a)=y,, w(b) = y 2. Derive a new 
boundary value problem for z(x). 


(b) Generalize this procedure to problem (6.11.10). Obtain a new problem 
with zero boundary conditions. What assumptions, if any, are needed 
for the coefficients a, a,, by, b;? ~ 


Using the shooting method of Section 6.11, solve the following boundary 


value problems. Study the convergence rate as i is varied. 


-2 1 2 
(a) yp” = —yy,1<x<2; p= >, xQD=F 
x : 2 3 


True solution: Y(x) = x/(1 + x). 


T 


td 
0) y= 2y',0<x< 5; 0, o(Z)=1. 
True solution: Y(x) = tan(x). 


Investigate the differential equation programs provided by your computer 
center. Note those that automatically control the truncation error by 
varying the stepsize, and possibly the order. Classify the programs as 
multistep (fixed-order or variable-order), Runge-Kutta, or extrapolation. 


* Compare one of these with the programs DDEABM [of Section 6.7 and 


Shampine and Gordon (1975)] and RKF45 fof Section 6.9 and 
Shampine and Watts (1976b)] by solving the problem 


with desired absolute errors of 1077, 10~®, and 10~°. Compare the results 
with those given in Tables 6.15 and 6.28. 


Consider the problem 


yes te-tan*(y(2))- 5 y(0) =0 


with c a given constant. Since y’(0) = 3, the solution y(t) is initially 
increasing as t increases, regardless of the value of c. As best you can, show 
that there is a value of c, call it c*, for which (1) if c > c*, the solution y(t) 
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increases indefinitely, and (2) if c <c*, then y(r) increases initially, but 
then peaks and decreases. Determine c* to within .00005, and then calcu- 
late the associated solution y(t) for 0 < ¢ < 50. 


58. Consider the system 
x(t) =Ax-—Bxy y(t) = Cxy — Dy 


This is known as the Lotka—Volterra predator—prey model for two popula- 
tions, with x(t) being the number of prey and y(t) the number of 
predators, at time /. 


(a) Let A=4, B=2, C=1, D = 3, and solve the model to at least 
three significant digits for 0 < t < 5. The initial values are x(0) = 3, 
y(0) = 5. Plot x and y as functions of t, and plot x versus y. 


(b) Solve the same model with x(0) = 3 and, in succession, y(0) = 1, 1.5, 
2. Plot x versus y in each case. What do you observe? Why would the 
point (3, 2) be called an equilibrium point? 


SEVEN 


LINEAR ALGEBRA 


The solution of systems of simultaneous linear equations and the calculation of 
the eigenvalues and eigenvectors of a matrix are two very important problems 
that arise in a wide variety of contexts. As a preliminary to the discussion of 
these problems in the following chapters, we present some results from linear 
algebra. The first section contains a review of material on vector spaces, matrices, 
and linear systems, which is taught in most undergraduate linear algebra courses. 
These results are summarized only, and no derivations are included. The remain- 
ing sections discuss eigenvalues, canonical forms for matrices, vector and matrix 
norms, and perturbation theorems for matrix inverses. If necessary, this chapter 
can be skipped, and the results can be referred back to as they are needed in 
Chapters 8 and 9. For notation, Section 7.1 and the norm notation of Section 7.3 
should be skimmed. 


7.1 Vector Spaces, Matrices, and Linear Systems 


Roughly speaking a vector space V is a set of objects, called vectors, for which 
operations of vector addition and scalar multiplication have been defined. A 
vector space V has a set of scalars associated with it, and in this text, this set can 
be either the real numbers R or complex numbers C. The vector operations must 
satisfy certain standard associative, commutative, and distributive rules, which 
we will not list. A subset W of a vector space V is called a subspace of V if W is 
a vector space using the vector operations inherited from V. For a complete 
development of the theory of vector spaces, see any undergraduate text on linear 
algebra [for example, Anton (1984), chap. 3; Halmos (1958), chap. 1; Noble 
(1969), chaps. 4 and 14; uae (1980), chap. 2). 


Example 1. V = R’, the set of all n-tuples (x,,..., x,,) with real entries x,, and 
R is the associated set of scalars. 


2 V=C", the set of all n-tuples with complex entries, and C is the set of 
scalars. 


3. V = the set of all polynomials of degree < n, for some given n, is a vector 
space. The scalars can be R or C, as desired for the application. 


463 


464 LINEAR ALGEBRA 


4. V=C{a, bj, the set of all continuous real valued for complex valued] 
functions on the interval [a, b], is a vector space with scalar set equal to R [or C]. 
The example in (3) is a subspace of C[a, b]. 


Definition Let V be a vector space and let v,, v,,...,u,, © V. 


I. We say that v,,...,v,, are linearly dependent if there is a set of 
scalars a,,..., @,,, With at least one nonzero scalar, for which 


av, +--+ +a,v,, = 0 
Since at least one scalar is nonzero, say a; # 0, we can solve for 


Pa A a oe Om 
(= Uy — 0, — 4 Oe 
t Qa. a. t a; f a; m 


i i 


We say that v, is a linear combination of the vectors 
Vy, --+5 Uj—ys Ujayr-++2 Um FOr a set of vectors to be linearly 
dependent, one of them must be a linear combination of the 
remaining ones. 


2. We say v,,...,U,, are linearly independent if they are not depen- 
dent. Equivalently, the only choice of scalars a,,...,a,, for 
which 


0, +--+ +a,v,,= 0 


is the trivial choice a, = --- = a,, = 0. No vu, can be written as 
a combination of the remaining ones. 


3. {v,,..-,¥,} is a basis for V if for every vu © V, there is a 
"unique choice of scalars a,,..., a,, for which 


V= 4,0, + +++ +4,0,, 


Note that this implies v,,...,v,, are independent. If such a 
finite basis exists, we say V is finite dimensional. Otherwise, it is 
called infinite dimensional. 


Theorem 7.1 If V is a vector space with a basis {v,,...,u,,}, then every basis 
for V will contain exactly m vectors. The number m is called the 


dimension of V. 


Example 1. {1, x, x”,...,x"} is a basis for the space V of polynomials of 
degree <n. Thus dimension V =n + 1. 


2. R" and C” have the basis {e,,...,e,}, in which 
e;= (0,0,...,0,1,0,...,0) (7.1.1) 
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with the 1 in position i. Dimension R",C” = n. This is called the standard basis 
for R” and C”, and the vectors in it are called unit vectors. 


3. C[a, b] is infinite dimensional. 


Matrices and linear systems Matrices are rectangular arrays of real or complex 
numbers, and the general matrix of order m X n has the form 


od (71.2) 


A matrix of order n is shorthand for a square matrix of order n X n. Matrices 
will be denoted by capital letters, and their entries will normally be denoted by 
lowercase letters, usually corresponding to the name of the matrix, as just given. 
The following definitions give the common operations on matrices. 


Definitior’ 1. Let A and B have order m X n. The sum of A and B is the 
matrix C = A + B, of order m X n, given by 


Cy = ay; + 5, 


2. Let A have order m X n, and let a« be a scalar. Then the scalar 
multiple C = aA is of order m X n and is given by 


3. Let A have order m Xn and B have order n X p. Then the 
product C = AB is of order m X p, and it is given by 


n 
Ci = » Qi dy; 
k=1 


4. Let A have order m X n. The transpose C = A™ has order n X m, 
and is given by 


The conjugate transpose C = A* also has order n X m, and 
Cy = jj 
The notation z denotes the complex conjugate of the complex number 


z, and z is real if and only if z = z. The conjugate transpose A* is also 
called the adjoint of A. 
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The following arithmetic properties of matrices can be shown without much 
difficulty, and they are left to the reader. 


(a) A+B=B+A (b) (A+ B)+C=A+4+(B+C) 
(ec) A(B+C)=AB+AC | (d) A(BC) =(AB)C (7.1.3) 
(e) (A +B)’ =A7+ BT (f) (AB)? = BTAT 


It is important for many applications to note that the matrices need not be 


square for the preceding properties to hold. 
The vector spaces R” and C” will usually be identified with the set of column 
vectors of order n X 1, with real and complex entries, respectively. The linear 


system 


QyX, + +++ +4,,X, = by 
(7.1.4) 
AngyX, + +++ +4,,,X, = b,, 
can be written as Ax = b, with A as in (7.1.2), and 
x=[x,....x,)7 b= [b,,...,5,)7 


The vector b is a given vector in R”, and the solution x is an unknown vector in 
R”. The use of matrix multiplication reduces the linear system (7.1.4) to the 
simpler and more intuitive form Ax = b. 

We now introduce a few additional definitions for matrices, including some 
special matrices. 


Definition 1. The zero matrix of order m X n has all entries equal to zero. It is 
denoted by 0,,,.,, of more simply, by 0. For any matrix A of order 
mxXn, 


A+0=0+A=A 
2. The identity matrix of order n is defined by J = [6;,], 
1 i=j 
5, = e pay (7.1.5) 
for all 1 < i, 7 <n. For all matrices A of order m X n and B of order 


nxXp, 
AI=A IB=B 


The notation 6;; denotes the Kronecker delta function. 


3. Let A be a square matrix of order n. If there is a square matrix B 
of order n for which AB = BA = I, then we say A is invertible, with 
inverse B. The matrix B can be shown to be unique, and we denote the 
inverse of A by A~}. 
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4. A matrix A is called symmetric if A’ =A, and it is called 
Hermitian if A* = A. The term symmetric is generally used only with 
real matrices. The matrix A is skew-symmetric if A’ = —A. Of 
necessity, all matrices that are symmetric, Hermitian, or skew-symmet- 
ric must also be square. 


5. Let A be an m Xn matrix. The row rank of A is the number of 
linearly independent rows in A, regarded as elements of R” or C”, and 
the column rank is the number of linearly independent columns. It can 
be shown (Problem 4) that these two numbers are always equal, and 
this is called the rank of A. 

For the definition and properties of the determinant of a square matrix A, see 
any linear algebra text [for example, Anton (1984), chap. 2; Noble (1969), chap. 
7; and Strang (1980), chap. 4]. We summarize many of the results on matrix 
inverses and the solvability of linear systems in the following theorem. 

Theorem 7.2 Let A be a square matrix with elements from R (or C), and let the 
vector space be V = R” (or C”). Then the following are equivalent 
statements. 

1. Ax=bhasa unique solution x € V for every b € V. 
2. Ax = b has a solution x € V for every b € V. 

3. Ax =O implies x = 0. 

4. A7 exists. 

5. Determinant (A) # 0. 

6 Rank (A) =n. 


Although no proof is given here, it is an excellent exercise to prove the 


: equivalence of some of these statements. Use the concepts of linear independence 


and basis, along with Theorem 7.1. Also, use the decomposition 
Ax = x,4q, + <*> +x,44,  *ER" or C" (7.1.6) 


with A,; denoting column j in A. This says that the space of all vectors of the 
form Ax is spanned by the columns of A, although they may be linearly 
dependent. 


Inner product vector spaces One of the important reasons for reformulating 
problems as equivalent linear algebra problems is to introduce some geometric 
insight. Important to this process are the concepts of inner product and orthogo- 
nality. 


468 LINEAR ALGEBRA 


Definition 1. The inner product of two vectors x, y € R" is defined by 


n 


(x,y) = Lx = x"y = y™ 
and for vectors x, y € C", define the inner product by 
n 
(x,y) = ze Xj Yj, = y*x 
i=1 


2. The Euclidean norm of x in C" or R" is defined by 


Wlall, = yx, x) = Ylagl? + oo + fe? (7.1.7) 


The following results are fairly straightforward to prove, and they are left to 
the reader. Let V denote C’ or R’. 


1. For all x, y,z € V, 
(x,y +z) = (x,y) +(x.z), (x+y, 2) = (x,z) + (2) 
2. For all x, y € V, 
(ax, y) = a(x, y) 
and for V= C", a EC, 
(x, ay) = a(x, y) 


3. InC", (x, y) = (y, x); and in R", (x, y} = (y, x). 
4. Forallx eV, 
(x,x) 20 
and (x, x) = 0if and only if x = 0. 
§. For all x, y & V, 


I(x, ») P< (x, x)(y, ») (7.1.8) 


This is called the Cauchy—Schwartz inequality, and it is proved in exactly the 
same manner as (4.4.3) in Chapter 4. Using the Euclidean norm, we can 
write it as 
‘I(x, )| < Ullal (7.1.9) 
6. Forall x, ye V, | 


lx + ylle < lla + Wyle (7.1.10) 


This is the triangle inequality. For a geometric interpretation, see the earlier 
comments in Section 4.1 of Chapter 4 for the norm ||/]|,, on C[a, bj. For a 
proof of (7.1.10), see the derivation of (4.4.4) in Chapter 4. 
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7. For any square matrix A of order n, and for any x, y © C’, 
(Ax, y) = (x, A*y) (7.1.11) 
The inner product was used to introduce the Euclidean length, but it is also 


used to define a sense of angle, at least in spaces in which the scalar set is R. 


Definition 1. For x, y in R", the angle between x and y is defined by 


set | a 
(x, y) = cos a 


‘Note that the argument is between ~1 and 1, due to- the 
Cauchy—Schwartz inequality (7.1.9). The preceding definition can be 
written implicitly as 


(x, y) = []xllall lla cos (37) (7.1.12) 


a familiar formula from the use of the dot product in R? and R°. 


2. Two vectors x and y are orthogonal if and only if (x, y) = 0. 
This is motivated by (7.1.12). If {x,..., x} is a basis for C” or R”, 
and if (x,x%)=0 for all i#/, 1<i,j<n, then we say 
{x,...,x]} is an orthogonal basis. If all basis vectors have 
Euclidean length 1, the basis is called orthonormal. 
3. A square matrix U is called unitary if 

U*U = UU* =I 
If the matrix U is real, it is usually called orthogonal, rather than 
unitary. The rows [or columns] of an order m unitary matrix form an 
orthonormal basis for C", and similarly for orthogonal matrices 
and R’. 

Example 1. The angle between the vectors 
x=(1,2,3)  y=(3,2,1) 


is given by 
10 
f= cos™! fa = .775 radians 
14 
2. The matrices 


cos@ sin#@ 
U, = = 


—sin@ cos 


S|~ I~ 
S|] ~ 


are unitary, with the first being orthogonal. 
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uf) 


aj = (x, uid) 


Figure 7.1 Illustration of (7.1.15). 


An orthonormal basis for a vector space V = R” or C” is desirable, since it is 


.then easy to decompose an arbitrary vector into its components in the direction 


of the basis vectors. More precisely, let {u,..., u(} be an orthonormal basis 
for V, and let x € V. Using the basis, 


x= aut --- tau” 


for some unique choice of coefficients a,,...,«,. To find a,;, form the inner 
product of x with u, and then 


(x, u')) = au, u') focee pe a, (ul, u') 


=a, : (7.1.13) 


using the orthonormality properties of the basis. Thus 


n 
x= VY (x, wu (7.1.14) 
j=l 


This can be given a geometric interpretation, which is shown in Figure 7.1. Using 
(7.1.13) 


a, = (x, uJ) = jxf}ollu)], cos (f(x, u)) 
= |[xll2 cos (0 (x, u)) (7.1.15) 


Thus the coefficient a, is just the length of the orthogonal projection of x onto 
the axis determined by uw“). The formula (7.1.14) is a generalization of the 
decomposition of a vector x using the standard basis {e,...,e")}, defined 
earlier. 


Example Let V = R’, and consider the orthonormal basis 


eee, aes t 
ye ae 2 2 
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Then for a given vector x = (x,, X2), it can be written as 


x= au + au 


x + xV3 
Sm (xu) = 


a, = (x, u™) = 


For example, 


v3 


1 
(jan 40 se 
(1,0) zu he 


7.2 Eigenvalues and Canonical Forms for Matrices 


The number A, complex or real, is an eigenvalue of the square matrix A if there is 
a vector x € C”, x # 0, such that 


Ax =Ax (7.2.1) - 


The vector x is called an eigenvector corresponding to the eigenvalue A. From 
Theorem 7.2, statements (3) and (5), A is an eigenvalue of A if and only if 


det(A — AI) =0 (7.2.2) 


This is called the characteristic equation for A, and to analyze it we introduce the 
function 
f,{d) = det (A - AT) 
-If A has order n, then f,(A) will be a polynomial of degree exactly n, called the 


' characteristic polynomial of A. To prove it is a polynomial, expand the determi- 
nant by minors repeatedly to get 


L4(A) = det (A — AT) 


a,—-A ay —< 41, 
a Qa5,-X a a 
21 22 23 2n 
= det}. 
any a8 Qnn i A 


= (a, — A)(ax2 — A) +++ (ay, — A) 
+ terms of degree <n—2. 
LAREN AY ag Om ae, 


+ terms of degree <n-2 ~ (7.2.3) 
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Also note that the constant term is 
f,(0) = det (A) (7.2.4) 
From the coefficient of \"~, define 
trace(A) = a,, +a, + --- +4,, (7.2.5) 
which is often a quantity of interest in the study of A. 

Since f,(A) is of degree n, there are exactly n eigenvalues for A, if we count 
multiple roots according to their multiplicity. Every matrix has at least one 
eigenvalue—eigenvector pair, and the n Xn matrix A has at most m distinct 
eigenvalues. 

Example 1. The characteristic polynomial for 
2 1 0 
A=!1 3 1 
0 1 2 
is 
f(A) = -¥4+ 7-140 4 8 


The eigenvalues are A, = 1, A, = 2, A; = 4, and the corresponding eigenvectors 


are : 
1 1 
0 u™ = 12 
-1 1 


1 . 
ger 4 pe 
1 


Note that these eigenvectors are orthogonal to each other, and therefore they are 
linearly independent. Since the dimension of R? (and C?) is three, these eigenvec- 
tors form an orthogonal basis for R? (and C%). This illustrates Theorem 7.4, 
which is presented later in the section. 


2. For the matrix 
10 0 3 
A=|0 1 0 f(A) = (1 - A) 
00 1 


and there are three linearly independent eigenvectors for the eigenvalue A = 1, 
for example, 


-[1,0,0]7 — [0,1,0}]7 —_[0,0,1]” 


All other eigenvectors are linear combinations of these three vectors. 
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3. For the matrix 
1 1 0 ‘ 
A=10 1 1 f(A) = (1-A) 


The matrix A has only one linearly independent eigenvector for the eigenvalue 
A = 1, namely 


x = [1,0,0]” 
and multiples of it. 


The algebraic multiplicity of an eigenvalue of a matrix A is its multiplicity as a 
root of f,(A), and its geometric multiplicity is the maximum number of linearly 
independent eigenvectors associated with the eigenvalue. The sum of the alge- 
braic multiplicities of the eigenvalues of an n X n matrix A is constant with 
respect to small perturbations in A, namely n. But the sum of the geometric 
multiplicities can vary greatly with small perturbations, and this causes the 
numerical calculation of eigenvectors to often be a very difficult problem. Also, 
the algebraic and geometric multiplicities need not be equal, as the preceding 
examples show. 


Definition Let A and B be square matrices of thé same order. Then A is similar 
to B if there is a nonsingular matrix P for which 


B=P-\4P : (7.2.6) 
Note that this is a symmetric relation since 
A=0"'BQ Q=P" 
The relation. (7.2.6) can be interpreted to say that A and B are matrix representa- 
tions of the same linear transformation T from V to V [V = R" or C"], but with 
respect to different bases for V. The matrix P is called the change of basis matrix, 
and it relates the two representations of a vector x € V with respect to the two 


bases being used [see Anton (1984), sec. 5.5 or Noble (1969), sec. 14.5 for greater 
detail]. 


We now present a few simple properties about similar matrices and their 
eigenvalues. 


1. If A and B are similar, then f,(A) = f,(A). To prove this, use (7.2.6) to’ 
show 


fg(A) = det(B— AI) = det[P71(A — AI) P| 


= det (P~!) det(A — AI) det (P) = f,(A) 
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since 
det (P) det(P~') = det(PP~1) = det(J) =1 


The eigenvalues of similar matrices A and B are exactly the same, and there 
is a one-to-one correspondence of the eigenvectors. If Ax = Ax, then using 


P-\4P( Pox) = AP x 
Bre=Azr z= Py (7.2.7) 


Trivially, z # 0, since otherwise x would be zero. Also, given any eigenvec- 
tor z of B, this argument can be reversed to produce a corresponding 
eigenvector x = Pz for A. 

Since f,(A) is invariant under similarity transformations of A, the coeffi- 
cients of f,(A) are also invariant under such similarity transformations. In 
particular, for A similar to B, 


trace(A) = trace(B)  -det(A) = det(B) (7.2.8) 


Canonical forms We now present several important canonical forms for 
matrices. These forms relate the structure of a matrix to its eigenvalues and 
eigenvectors, and they are used in a variety of applications in other areas of 
mathematics and science. 


Theorem 7.3 (Schur Normal Form) Let A have order n with elements from C. 


Then there exists a unitary matrix U such that 
T = U*AU (7.2.9) 


is upper triangular. 
Since T is triangular, and since U* = U7}, 


FOSh00= 04h) Ost)- G220) 


and thus the eigenvalues of A are the diagonal elements of T. 


Proof The proof is by induction on the order n of A. The result is trivially true 


for n = 1, using U = [1]. We assume the result is true for all matrices of 
order n < k — 1, and we will then prove it has to be true for all matrices 


of order n = k. 

Let d, be an eigenvalue of A, and let u™ be an associated eigenvector 
with |Ju|}, = 1. Beginning with u™, pick an orthonormal basis for C*, 
calling it {u®,..., u“}. Define the matrix P, by 


P, = [u™, y®, ones ut*)] 


which is written in partitioned form, with columns u®,..., uw) that are 
orthogonal. Then P,*P, = J, and thus Py} = P;*. Define 


B, = PAP, 
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Claim: 


with A, of order k — 1 and a,...,a, some numbers. To prove this, 
multiply using partitioned matrices: 


AP, = A[u®,..., u] = [Au,..., Au] 
= [A,u®, v®,..., 0] vp) = Ay) 
B, = P#AP, = [A, Pru, Pxv,..., Pro] 
Since P;*P, = J, it follows that P*u® = e® = [1,0,...,0]7. Thus 
B, = [Aye, w,..., w] wi) = Pry 


which has the desired form. ? 
By the induction hypothesis, there exists a unitary matrix P, of order 
k — 1 for which 
T = BsA,P, 


is an upper triangular matrix of order k — 1. Define 


10 --. 0 
0 
oe aaa P, 
0 
Then P, is unitary, and 

A 2 Vr 

2 0 
0 
ir re 
0 - coos 

= ra =T 
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an upper triangular matrix. Thus 
T = PSB P, = PPPAP,P, = (P,P,)*A(P,P2) 
T= U*AU U=P,P, 


and U is easily unitary. This completes the induction and the proof. 


Exciple For the matrix 
2 6 0 
A= 16 —.2 0 
—16 12 3.0 
the matrices of the theorem and (7.2.9) are 


1 0 -i 
T= {0 3 2 


0 0 -i 


6 0 —8 
U=1].8 0 6 
0 1.0 0 


This is not the usual way in which eigenvalues are calculated, but should be 
considered only as an illustration of the theorem. The theorem is used generally 
as a theoretical tool, rather than as a computational tool. 


Using (7.2.8) and (7.2.9), 


where A,,...,A,, are the eigenvalues of A, which must form the diagonal 
elements of JT. As a much more important application, we have the following 
well-known theorem. 


Theorem 7.4 (Principal Axes Theorem) Let A be a Hermitian matrix of order 
n, that is, A* = A. Then A has n real eigenvalues A,,...,A,, not 
necessarily distinct, and n corresponding eigenvectors u,..., u&” 
that form an orthonormal basis for C”. If A is real, the eigenvectors 
u™,...,u’™ can be taken as real, and they form an orthonormal 
basis of R”. Finally there is a unitary matrix U for which 


U*AU = D = diag[Ay,-..,A,] (7.2.12) 


is a diagonal matrix with diagonal elements A,,...,A,. If A is also 
real, then U can be taken as orthogonal. 


Proof - From Theorem 7.3, there is a unitary matrix U with 
U*AU = T 


with T upper triangular. Form the conjugate transpose of both sides to . 
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obtain 
T* = (U*AU)* = U*A*(U*)* = U*AU = T 
Since T* is lower triangular, we must have 
T = diag[A,,...,A,] 


Also, T* = T involves complex conjugation of all elements of 7, and 
thus all diagonal elements of 7 must be real. 
Write U as 


U= [u™,..., u™| 
Then T = U*AU implies AU = UT, . 
Ay 0 
Alu®,...,u] = [u®,..., u] 
0 r 
[4u,.., Au] = [Au®,..., Agu] 
and 


AuP =hju) jf =1,...,0 (7.2.13) 


Since the columns of U are orthonormal, and since the dimension of C” 
is , these must form an orthonormal basis for C”. We omit the proof of 


the results that follow from A being real. This completes the proof. 


Example From an earlier example in this section, the matrix 
2 1 °0 
A={!1 3 1 
0 1 2 


has the eigenvalues 4, = 1, 4, = 2, A; =4 and corresponding orthonormal 
eigenvectors 


a) : : (2) J ‘ Q) 1 |} 
us = =| -1 u’ ==] 0 us? = —=}2 
34 7 6 | 


These form an orthonormal basis for R? or C?. 


There is a second canonical form that has recently become more important for 
problems in numerical linear algebra, especially for solving overdetermined 
systems of linear equations. These systems arise from the fitting of empirical data 
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using the linear least squares procedures [see Golub and Van Loan (1983), chap. 
6, and Lawson and Hanson (1974)}. 


Theorem 7.5 (Singular Value Decomposition) Let A be order n X m. Then 
there are unitary matrices U and V, of orders m and n, respectively, 


such that 
V¥AU=F (7.2.14) 


is a “diagonal” rectangular matrix of order n X m, 
F= oe (7.2.15) 


The numbers p;,..., , are called the singular values of A. They are 
all real and positive, and they can be arranged so that 


By > bozos Sp, > 0 (7.2.16) 
where r is the rank of the matrix A. 

Proof Consider the square matrix A*A of order m. It is a Hermitian matrix, . 
and consequently Theorem 7.4 can be applied to it. The eigenvalues of 
A*A are all real; moreover, they are all nonnegative. To see this, assume 

A*Ax = Xx x#0. 
Then 
(x, A*Ax) = (x, Ax) =Allxi]3 
(x, A*Ax) = (Ax, Ax) = || Axl] 
A= tal 20 
\ [lle 
This result also proves that 
Ax=0 ifandonlyif A*Ax=O0 x €C" (7.2.17) 
From Theorem 7.3, there is an m X m unitary matrix U such that 
U*A*AU = diag[d,,...,A,,0,-.-,0] (7.2.18) 


where all 1; # 0,1 <i <r, and all are positive. Because A*A has order 
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m, the index r < m. Introduce the singular values 
gem ~; F=1,...,7 (7.2.19) 


The U can be chosen so that the ordering (7.2.16) is obtained. Using the 
diagonal matrix 


D = diag[p,,.--,p,,0,---,0] 
of order m, we can write (7.2.18) as 
(AU)*(AU) = D? (7.2.20) 
Let W = AU. Then (7.2.20) says W*W = D?. Writing W as 


w=([W®,...,Ww™|] wwec 


we have 
2 . 
(WO, WO) = Fe bass? (7.2.21) 
0 jor 
and 
(WO, WY)=0  ifits (7.2.22) 


From (7.2.21), WY) = 0 if 7 > r. And from (7.2.22), the first r columns 
of W are orthogonal elements in C”. Thus the first r columns are linearly 
independent, and this implies r < n. 


Define 
ae Cree 
VO=2—WwO  jfFH1,..,r (7.2.23) 
Bj 
This is an orthonormal set in C”. If r <n, then choose V’*),..., V™ 
so that {V,..., V} is an orthonormal basis for C”. Define 
v=[v,...,V] (7.2.24) 


Easily V is an n X n unitary matrix, and it can be verified directly that 
VF = W, with F as in (7.2.15). Thus 


VF = AU 


which proves (7.2.14). The proof that r = rank (A) and the derivation of 
other properties of the singular value decomposition are left to Problem 
19. The singular value decomposition is used in Chapter 9, in the least 
squares solution of overdetermined linear systems. | 
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t 


To give the most basic canonical form, introduce the following notation. 
Define the n X n matrix 


A 1 0 0 
0 A 1 
J,(A) =] * — n>1 (7.2.25) 
: an | 
oO .- —— x 


where J,(A) has the single eigenvalue i, of algebraic multiplicity n and geometric 
multiplicity 1. It is called a Jordan block. 


Theorem 7.6 (Jordan Canonical Form) Let A have order n. Then there is a 
nonsingular matrix P for which 


Jy) 0 


Jn (A2) 


P-4P = (7.2.26) 


0 J, (X,) 


The eigenvalues A,,A,,...,A, need not be distinct. For A Hermi- 
tian, Theorem 7.4 implies we must have n, =n, = ++: =n,=1, 
for in that case the sum of the geometric multiplicities must be 7, 
the order of the matrix A. 


It is often convenient to write (7.2.26) as 
P-4P=D+N 
D = diag [A,,.-.,A,] (7.2.27) 


with each A, appearing n; times on the diagonal of D. The matrix N has all zero 
entries, except for possible 1s on the superdiagonal. It is a nilpotent matrix, and 
more precisely, it satisfies 

N"=0 (7.2.28) 


The Jordan form is not an easy theorem to prove, and the reader is-referred to 
any of the large number of linear algebra texts for a development of this rich 
topic [e.g., see Franklin (1968), chap. 5; Halmos (1958), sec. 58; or Noble (1969), 
chap. 11]. 


7.3 Vector and Matrix Norms 


The Euclidean norm |{x|j; has already been introduced, and it is the way in 
which most people are used to measuring the size of a vector. But there are many 
situations in which it is more convenient to measure the size of a vector in other 
ways. Thus we introduce a general concept of the norm of a vector. 
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Definition Let V be a vector space, and let N(x) be a real valued function 
defined on V. Then N(x) is a norm if: 


(N1) M(x) > 0 for all x € V, and N(x) = 0 if and only if x = 0. 
(N2) N(ax) = |a|N(x), for all x © V and all scalars a. 
(N3) N(x + y) < N(x) + N(y), for all x, y © V. 


The usual notation is ||x|| = N(x). The notation N(x) is used to emphasize 
that the norm is a function, with domain V and range the nonnegative real 
numbers. Define the distance from x to y as ||x — y||. Simple consequences are 
the triangular inequality in its alternative form 


|x — 2I] < Ile — yl +Ily — 21 
and the reverse triangle inequality, 


lixl|-iyllsix-yll xyeV (7.3.1) 


Example 1. For 1 < p < 00, define the p-norm, 
n 1l/p . 
xi, = | Lill xec" (7.3.2) 
1 


2. The maximum norm is 


II>llo = Max jx,) x EC" (7.3.3) 
l<jsn 


The use of the subscript oo on the norm is motivated by the result in Problem 23. 


3. For the vector space V = C[a, b], the function norms |[{f||, and || f|],, were 
introduced in Chapters 4 and 1, respectively. 


Example Consider the vector x = (1,0, —1, 2). Then 
Ixli=4 © Illa =V6 IIH, = 2 


To show that ||- ||, is a norm for a general p is nontrivial. The cases p = 1 
and oo are straightforward, and ||- ||, has been treated in Section 4.1. But for 
1<p <0, p #2, it is difficult to show that ||-||, satisfies the triangle in- 
equality. This is not a significant problem for us since the main cases of interest 
are p = 1,2, 0. To give some geometrical intuition for these norms, the unit 
circles 


S,= {x€R'|IIx,=1}  p=1,2,00 (7.3.4) 


are sketched in Figure 7.2. 


482 LINEAR ALGEBRA 


(0,1) 


(1, 0) (1, 0) 


5; 52 


Figure 7.2 The unit sphere S, using vector norm || - ||,. 


We now prove some results relating different norms. We begin with the 
following result on the continuity of N(x) = ||x||, as a function of x. 


Lemma Let N(x) be a norm on C” (or R”). Then N(x) is a continuous 
function of the components x,, X2,..., xX, Of x. 


Proof We want to show that 
x,y 98 1,2,...50 
implies 
N(x) = N(y) 
Using the reverse triangle inequality (7.3.1), 
IN(x)- N(y)) <N(x-y) x yEC" 


Recall from (7.1.1) the definition of the standard basis {e",..., e} 
for C”. Then 


x-y= 2d (4 y)e? 
j=l 


N(x—y) s Lx, —¥IN(e) <Ilx — vile LE N(eM) 
j=i j= 


IM(x) - My) Selle lle = Y Ne) (7.3.5) 


This completes the proof. | a 


Note that it also proves that for every vector norm N on C", there is a c > 0 
with 
N(x) <ellx|],  allx EC” (7.3.6) 


Just let y = 0 in (7.3.5). The following theorem proves the converse of this result. 
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Theorem 7.7 (Equivalence of Norms) Let N and M be norms on V = C” or 
_R". Then there are constants c,, c. > 0 for which 


c,M(x) < N(x) <c,.M(x) allxeV (7.3.7) 


Proof It is sufficient to consider the case in which WN is arbitrary and M(x) = 
||x||,.. Combining two such statements then leads to the general result. 
Thus we wish to show there are constants ¢,, c, for which 


ClIHoo S N(x) < callllec (7.3.8) 


or equivalently, 


¢<N(z)<c, allzeS (7.3.9) 


in which S is the set of all points z in C” for which |{z||,, = 1. The upper 
inequality of (7.3.9) follows immediately from (7.3.6). 

Note that S is a closed and bounded set in C”, and N is a continuous 
function on S. It is then a standard result of advanced calculus that N 
attains its maximum and minimum on S at points of S, that is, there are 
constants c,,¢, and points z,, z, in S for which 


¢, = N(z,) < N(z) < N(z,)=c, allzeES 


Clearly, c,,c, => 0. And if c, = 0, then M(z,)=0. But then z, = 0, 
contrary to the construction of S that requires |]z,||,, = 1. This proves 
(7.3.9), completing the proof of the theorem. Note: This theorem does 
not generalize to infinite dimensional spaces. | 


Many numerical methods for problems involving linear systems produce a 
sequence of vectors {x”|m > 0}, and we want to speak of convergence of this 


sequence to a vector x. 


Definition A sequence of vectors {x, x, ..., x, +--+}, in C” or R” is said 
to converge to a vector x if and only if 


Ix-x*?1>50 as m>0 


Note that the choice of norm is left unspecified. For finite 
dimensional spaces, it doesn’t matter which norm is used. Let M and 
N be two norms on C”. Then from (7.3.7), 


M(x — x(™) < N(x — x) <c,M(x—x™) = m>0 


and M(x — x") converges to zero if and only if N(x — x”) does 
the same. Thus x”) > x with the M norm if and only if it 
converges with the N norm. This is an important result, and it is not 
true for infinite dimensional spaces. 


Matrix norms The set of all 1 Xn matrices with complex entries can be 
considered as equivalent to the vector space.C”, with a special multiplicative 
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operation added onto the vector space. Thus a matrix norm should satisfy the 
usual three requirements N1-N3 of a vector norm. In addition, we also require 
two other conditions. 
Definition A matrix norm satisfies N1-N3 and the following: 

(N4) ABI < |All BIL 

(N5) Usually the vector space we will be working with, V = C” or 


R", will have some vector norm, call it ||x\|,, x € V. We 
require that the matrix and vector norms be compatible: 


Axl], < HAM xl,  allxeV all A 


Example Let A ben Xn, |\- ||, =|] - lz. Then for x € C”, 


ii 


< z {Zee} [Epr}] 


by using the Cauchy—Schwartz inequality (7.1.8). Then 


All = | 


U 


; n 1/2 ; 
Axl]. < F(A) |lxlo F(A) = |= ay (7.3.10) 


i 1 


F(A) is called the Frobenius norm of A. Property N5 is shown using (7.3.10) 


directly. Properties N1—N3 are satisfied since F(A) is just the Euclidean norm on 
C”’. It remains to show N4. Using the Cauchy—Schwartz inequality, 


+i 2y2 
Papal 


[a lder)limo]” 


ij=l 


* F(AB) 


n 
» indy; 
k=1 


IA 


= F(A) F(B) 
Thus F(A) is a matrix norm, compatible with the Euclidean norm. 


Usually when given a vector space with a norm jj - ||,, am associated matrix 
norm is defined by 


Axl, 
Al] = Supremum asd (7.3.11) 


x#o UII, 
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Table 7.1 Vector norms and associated operator matrix norms 


Vector Norm Matrix Norm 
na n 
th = 2 bl Ath = Max > |a;,| 
im1 isjsn joy 
n 1/2 
xl. = ¥ | llAlk = %(A*A) 
j=] 
: n 
Ilo = Max |x;| Allo = Max ¥ fa;/| 
lsisn lsisn j=l 


It is often called the operator norm. By its definition, it satisfies N5: 
Axl], <All lx, =x EC" (7.3.12) ° 


For a matrix A, the operator norm induced by the vector norm ||x||, will be 
denoted by ||Al|,. The most important cases are given in Table 7.1, and the 
derivations are given later. We need the following definition in order to define 


HAll2- 


Definition Let A be an arbitrary matrix. The spectrum of A is the set of all 
eigenvalues of A, and it is denoted by o(A). The spectral radius is 
the maximum size of these eigenvalues, and it is denoted by 


r,(A) = Max |A| (7.3.13) 
AEa(A) 


To show (7.3.11) is a norm in general, we begin by showing it is finite. Recall 
from Theorem 7.7 that there are constants c,, c, > 0 with 


CyllxIl2 < Ill, < eallxll, =» EC” 
Thus, 
\|Ax]], Ca |] Ax c 
Maly calle © 8 or 4) 
Ill, e\lixll, 


which proves \| All is finite. 
At this point it is interesting to note the geometric significance of ||Al}: 


‘(ir 


By noting that the supremum doesn’t change if we let |{z||, < 1, 


AxII 


= Supremum||Az||, 
N2io=2 


|| Al] = Supremum <= Supremum 
x#0 lI x#0 


o 


|| Al] = Supremum || Az|| (7.3.14) 
Nzt, <2 
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Let 

B= {z€C"|[Izl, < 1} 
the unit ball with respect to || - ||,. Then 


|| Al] = Supremum|[|Az]|, = Supremum]|w|[, 
zéeB wed 


with A(B) the image of B when A is applied to it. Thus |||] measures the effect 
of A on the unit ball, and if || Al] > 1, then ||.A|| denotes the maximum stretching 
of the ball B under the action of A. 


Proof Following is a proof that the operator norm || Al] is a matrix norm. 


1. Clearly {[.4]| => 0, andif A = 0, then |{.Aj| = 0. Conversely, if ||.A|j = 0, 
then ||Ax||, = 0 for all x. Thus Ax = 0 for all x, and this implies 
A=0. 


2. Let @ be any scalar. Then. 


\|aAl] = Supremum||aAxj|, = Supremum|aj|jAx||, = |a| Supremum||Ax|}, 
xis Ix|l,<1 xls? 


= lal All . 
3. For any x € C’, 
(A + B)x||, = lx + Bxl], < Axl], + I BxI, 
since |] - ||, is a norm. Using the property (7.3.12), 
(A + B)x], < All lll, + BI Mtl. 


A+B 
NA+ Bole an + yay 
Ixll, 


This implies 
JA + Bi < | Al] + Bll 
4. For any x © C”, use (7.3.12) to get 
|(AB)x||, =] A(Bx) |], < WAN Bll, < HAI B I Md, 


ABx|, 
> SA il 
II*Il, 


This implies. 
ABI < |All Bll a 
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We now comment more extensively on the results given in Table 7.1. 


Example 1. Use the vector norm 


n 


lth = py 1x, xec" 
jal 
Then 


¥ ax xj 


j=l 


| Axlh = y 


i=] 


n n 
= be la; 7 |x | 
i=1 j= 
Changing the order of summation, we can separate the summands, 


Axl y \x;| y aul 
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j=l i=l 
Let 
c= Max Liew (7.3.15) 
Then 
|Axllr S ellxlh 
and thus 


Ath <¢ 
To show this is an equality, we demonstrate an x for which 


Ath 


{ll 


Let k be the column index for which the maximum in U 3.15) is attained. Let 


x = e*), the kth unit vector. Then ||x||, = 1 and 


n 


= DV iaisl mE 


¥ a,x; x; 


j=l 


Axl], = = 


i=1 


This proves that for the vector norm || - ||], the operator norm is 
iy | 


n 
|All; = Max i |a;; 
l<jsn j=1 


This is often called the column norm. 


2. For C” with the norm jjx||,,, the operator norm is 


(7.3.16) 


Allo = bases 2 | (7.3.17) 
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This is called the row norm of A. The proof of the formula is left as Problem 25, 
although it is similar to that for || Ajl,. 


3. Use the norm ||x||, on C”. From (7.3.10), we conclude that 
|All, < F(A) (7.3.18) 


In general, these are not equal. For example, with A = J, the identity matrix, use 
(7.3.10) and (7.3.11) to obtain 
FUI)=yn = |II. =1 


We prove 
lll = yre( A*A) (7.3.19) 


as stated earlier in Table 7.1. The matrix A*A is Hermitian and all of its 
eigenvalues are nonnegative, as shown in the proof of Theorem 7.5. Let it have 
the eigenvalues 


A, 2A,2 °°: 2A, 20 


n 


counted according to their multiplicity, and let u,..., u&” be the correspond- 
ing eigenvectors, arranged as an orthonormal basis for C”. 
For a general x € C’, 


\JAxl]3 = (Ax, Ax) = (x, A*Ax) 


Write x as 
x= Yau) a, = (x,u) (7.3.20) 
j=l 
Then 
AtAx = J) aj,A*AuD = Y) aru 
j=l j=l 
and 


n 


n : n- 
2— i y) = {2 
Axl] = = aul, ajA ju) = pm A,|a,| 
i=] j=) j=l 


<A, D foe]? = Alli} 


j=l 
using (7.3.20) to calculate ||x||,. Thus 


IIAIl2 < Ay 
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Equality follows by noting that if x = u, then {|x/], = 1 and — 
J Ax||3 = (x, A*Ax) = (u, AU) =A, 


This proves (7.3.19), since 4, = r,(A*A). It can be shown that AA* and A*A 
have the same nonzero eigenvalues (see Problem 19); thus, r,(.4A*) = r,(.A*A), 
an alternative formula for (7.3.19). It also proves 


Alla = HA" ll (7.3.21) 


This is not true for the previous matrix norms. 
It can be shown fairly easily that if A is Hermitian, then 


All2 = r,(A) (7.3.22) 
This is left as Problem 27. 
Example Consider the matrix 
= 1 -2 
a | = Al 


Then. 
|All, = 6 |All, = ¥15 + ¥221 = 5.46 All, = 7 
As an illustration of the inequality (7.3.23) of the following theorem, 


1,(A) = 5 + ¥33 


5 #537 <All 


Theorem 7.8 Let A be an arbitrary square matrix. Then for any operator matrix 
norm, 


r,(A) < |All (7.3.23) 


Moreover, if « > 0 is given, then there is an operator matrix norm, 
denoted here by {j - ||., for which 


All. S17,(A) + (7.3.24) 


Proof To prove (7.3.23), let ||- |] be any matrix norm with an associated 
compatible vector norm || - ||,- Let A be the eigenvalue in o( A) for which 


[A] = 7,(A4) 
and let x be an associated eigenvector, ||xj], = 1. Then 
r(A) = JA] = Axl], < 4, < HAll Ill, = HAll 


which proves (7.3.23). 
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The proof of (7.3.24) is a nontrivial construction, and a proof is given 
in Isaacson and Keller (1966, p. 12). | 


The following corollary is an easy, but important, consequence of Theo- 
rem 7.8. 


Corollary For a square matrix A, r,(A) < 1 if and only if ||A|| < 1 for some 
operator matrix norm. | 


This result can be used to prove Theorem 7.9 in the next section, but we prefer 
to use the Jordan canonical form, given in Theorem 7.6. The results (7.3.22) and 
Theorem 7.8 show that r,(A) is almost a matrix norm, and this result is used in 
analyzing the rates of convergence for some of the iteration methods given in 
Chapter 8 for solving linear systems of equations. 


7.4 Convergence and Perturbation Theorems 


The following results are the theoretical framework from which we later construct 
error analyses for numerical methods for linear systems of equations. 


Theorem 7.9 Let A be a square matrix of order n. Then A” converges to the 
zero matrix as m — oo if and only if r,(A) < 1. 


Proof We use Theorem 7.6 as a fundamental tool. Let J be the Jordan 
canonical form for A, 


P-4P=J 

Then 
A™ = (PJP~1)" = Py™p-} (7.4.1) 

and A” — 0 if and only if J” — 0. Recall from (7.2.27) and (7.2.28) that 
J can be wnitten as 

J=D+N 
in which 

D = diag[A,,.-., Ay] 

contains the eigenvalues of J (and A), and N is a matrix for which 


N"=0 
By examining the structure of D and N, we have DN = ND. Then 


J™=(D+N)"= > ("7 |pe wi 
j=0 
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and using N/ = 0 for j > n, 


way (77) Deas (7.4.2) 
j=0 


Notice that the powers of D satisfy 
m-j2m-n>o as moo (7.4.3) 
We need the following limits: For any positive c < 1 and any r > 0, 


Limit m’c” = 0 (7.4.4) 
This can be proved using L’Hospital’s rule from elementary calculus. 

In (7.4.2), there are a fixed number of terms, n + 1, regardless of the 
size of m, and we can consider the convergence of J” by considering 
each of the individual terms. Assuming 7,(A) <1, we know that all 
jA;| <1, §=1,...,. And for any matrix norm 


™\ ym-ins Os cine) 
(7) < NID" 


Using the row norm, we have that the preceding is bounded by 
1 £3 eS 
Fl lem’ [ro A] 


which converges to zero as m — oo, using (7.4.3) and (7.4.4), for0 <j < 
n. This proves half of the theorem, namely that if 7,(4) <1, then J” 
and A”, from (7.4.1), converge to zero as m — oo. 

Suppose that r,(A) => 1. Then let A be an eigenvalue of A for which 
jA| = 1, and let x be an associated eigenvector, x # 0. Then 


A™y = XN" 


and clearly this does not converge to zero as m — oo. Thus it is not 
possible that A” — 0, as that would imply A”x — 0. This completes the 
proof. | 


Theorem 7.10 (Geometric Series) Let A be a square matrix. If r,(A) < 1, then 


(J — A)~' exists, and it can be expressed as a convergent series, 
(I-A) = T+ A+ Att ees tAamtess (7.45) 


Conversely, if the series in (7.4.5) is convergent, then r,(A) < 1. 


Proof Assume r,(A) < 1. We show the existence of (J — A)~! by proving the 


equivalent statement (3) of Theorem 7.2. Assume 


(I- A)x =0 
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Then Ax = x, and this implies that 1 is an eigenvalue of A if x # 0. But 
we assumed r,(A) < 1, and thus we must have x = 0, concluding the 


proof of the existence of (J —- A)7}. 
We need the following identity: 


(I-A) I+ A+tA? 4+ --- +A") =I-A™+! (7.4.6) 
which is true for any matrix A. Multiplying by (J — A)7}, 
I+ A+ A? te 4A = (1 — A) (TD -— Amt?) 
The left-hand side has a limit if the nght-hand side does. By Theorem 
7.9, r,(A) <1 implies that A"*! > 0 as m — oo. Thus we have the 
result (7.4.5). , 
Conversely, assume the series converges and denote it by 
B=I+A+A?+---+A™+--- 
Then B — AB = B — BA =I, and thus J — A has an inverse, namely B. 
Taking limits on both sides of (7.4.6), the left-hand side has the limit 
(J — A)B =I, and thus the same must be true of the right-hand limit. 
But that implies 
Amti_,Q as m— co 
By Theorem 7.9, we must have r,(.A) < 1. | 
Theorem 7.11 Let A be a square matrix. If for some operator matrix norm, 


|Aj| <1, then (J — A)~! exists and has the geometric series 
expansion (7.4.5). Moreover, 


=i 1 
(CSA Sea ia (7.4.7) 


Proof Since ||A]| < 1, it follows from (7.3.23) of Theorem 7.8 that r,(A) <1. 
Except for (7.4.7), the other conclusions follow from Theorem 7.10. For 


(7.4.7), let : 
B,=I+ A+++: +A™ 


From (7.4.6), 
B,,= (I-A) ‘(1 - A™*) 
(1- A)" - 8B, = (1-4) [TP - (am) 


=(1-A)7a7*! (7.4.8) 
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Using the reverse triangle inequality, 
YC = A) = Ball <= A)? = Baul 
<I - A)" ain 
Since this converges to zero as m -> oo, we have 
Ball +2 — A)" as m+ 00 (7.4.9) 
From the definition of B,, and the properties of a matrix norm, 
WBrall < Z{] + WA + WAN? +--+ +1411" 


1 — jAl"*? 1 
ee 
1—|4 ~ 1-(4il 


Combined with (7.4.9), this concludes the proof of (7.4.7). | 


Theorem 7.12 Let A and B be square matrices of the same order. Assume A is 
nonsingular and suppose that 


1 
A- BI < —— (7.4.10) 
, a 
Then B is also nonsingular, 
[Boy < eat ME (7.4.11) 
1—||A~" IA — Bil a 
and 
A7'"A - B 
WA-*- Bo < Lee l (7.4.12) 


1—{A-"A4 — Bil 
Proof Note the identity 
B=A-(A-B)=A[I-A-(A-B)| (7.4.13) 


The matrix [J — A~1(A — B)] is nonsingular using Theorem 7.11, based 
on the inequality (7.4.10), which implies 


JA-"(A — B)|] <A - BI <1 
Since B is the product of nonsingular matrices, it too is nonsingular, 


B-'=[1—A-"(A —B)]~'A7? 
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The bound (7.4.11) follows by taking norms and applying Theorem 7.11. 
To prove (7.4.12), use 


A7'-— B-'=A-\B—A)B™! 
Take norms again and.apply (7.4.11). | 


This theorem is important in a number of ways. But for the moment, it says 
that all sufficiently close perturbations of a nonsingular matrix are nonsingular. 


Example We illustrate Theorem 7.11 by considering the invertibility of the 
matrix 


4 1 0 0 
1 4 1 0 
0 1 4 1 
A= 
1 4 1 
0 0 1 4 
Rewrite A as 
A=4(1+B), 
0 0 0 0 
4 
: 0 : 0 0 
4 4 
B=| 0 : 
1 
4 
0 : 0 
4 


II 


Using the row norm (7.3.17), ||Bl|,, = 4. Thus (J + B)~' exists from Theorem 


7.11, and from (7.4.7), 


=2 


le+ 3)". s 
2 


Thus A~! exists, A~! = 1(7 + B)7}, and 
Alo S 2 
By use of the row norm and inequality (7.3.23), 


r,(A)<6 1,(A7*) <3 
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Since the eigenvalues of A~! are the reciprocals of those of A (see problem 27), 
and since all eigenvalues of A are real because A is Hermitian, we have the 
bound 


2<|AJ<6  allA€o(A) 


For better bounds in this case, see the Gerschgorin Circle Theorem of Chapter 9. 


- Discussion of the Literature 


The subject of this chapter is linear algebra, especially selected for use in deriving 
and analyzing methods of numerical linear algebra. The books by Anton (1984) 
and Strang (1980) are introductory-level texts for undergraduate linear algebra. 
Franklin’s (1968) is a higher level introduction to matrix theory, and Halmos’s 
(1958) is a well-known text on abstract linear algebra. Noble’s (1969) is a 
wide-ranging applied linear algebra text. Introductions to the foundations are 
also contained in Fadeeva (1959), Golub and Van Loan (1982), Parlett (1980), 
Stewart (1973), and Wilkinson (1965), all of which are devoted entirely to 
numerical linear algebra. For additional theory at a more detailed and higher 
level, see the classical accounts of Gantmacher (1960) and Householder (1965). 
Additional references are given in the bibliographies of Chapters 8 and 9. 
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| Problems 


1. Determine whether the following sets of vectors are dependent or indepen- 
dent. 


(a) (1,2, —1,3), 3,-1,1,1), G,9, —5,11) 
| (b) (1,1,0), (0,1, 1), (0,0,1) 


2. Let A, B, and C be matrices of order m X n, n X p, and p X q, respec- 
| tively. 


(a) Prove the associative law (AB)C = A(BC). 
(b) Prove (AB)? = BTA’. 
3. (a) Produce square matrices A and B for which AB # BA. 


(b) Produce square matrices A and B, with no zero entries, for which 
AB =0, BA #0. 


4. Let A be a matrix of order m X n, and let r and ¢ denote the row and 
column rank of A, respectively. Prove that r = c. Hint: For convenience, 
assume that the first r rows of A are independent, with the remaining rows 
dependent on these first r rows, and assume the same for the first c 
columns of A. Let A denote the r X n matrix obtained by deleting the last 
m — r rows of A, and let F and é denote the row and column rank of A, 

respectively. Clearly 7 = r. Also, the columns of A are elements of C’, 

which has dimension r, and thus we must have ¢ < r. Show that ¢ = c, 

thus proving that c < r. The reverse inequality will follow by applying the 

same argument to A’, and taken together, these two inequalities 

imply r = c. 


5. Prove the equivalence of statements (1)-(4) and (6) of Theorem 7.2. Hint: 
Use Theorem 7.1, the result in Problem 4, and the decomposition (7.1.6). 


6. Let 
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x 1 0 0 

1 x 1 0 0 

1 x 1 : 

f,(x) = det 6 
] 

0 --- 0 1 x 


with the matrix of order n. Also define fo(x) = 1. 


(a) 


Show 


fnsil*) = h(x) ~fra(x) ne] 


(b) Show 


| Let --bee-a-square-matrix-of order_n_with real_entries—The-fanction ——__—— 


fx)=5,(5] 120 


with S,(x) the Chebyshev polynomial of the second kind of degree n 
(see Problem 24 in Chapter 4). 


n n 
a(x,.-%,) =(4nx)= DOL Fj jX jXj xER’, 


i=1 j=1 


is called the quadratic form determined by A. It is a quadratic polynomial in 
the n variables x,,...,X,, and it occurs when considering the maximiza- 
tion or minimization of a function of 7 variables. 


(a) 
(b) 


Prove that if A is skew-symmetric, then g(x) = 0. 


For a general square matrix A, define A, = }(A +A"), A= 
4(4 — AT). Then A =A, + A,. Show that A, is symmetric, and 
(Ax, x) = (A x, x), all x © R”. This shows that the coefficient matrix 
A for a quadratic form can always be assumed to be symmetric, 
without any loss of generality. 


8. Given the orthogonal vectors 


u™ = (1,2,-1)  u® = (1,1,3) 


produce a third vector u® such that {u™, u®, u©} is an orthogonal basis 
for R?. Normalize these vectors to obtain an orthonormal basis. 
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9. 


10. 


11. 


12. 


For the column vector w € C” with ||wi|, = yw*w = 1, define the n x n 
matrix 


A=I-—2ww* 
(a) For the special case w = [4, 2, 2]”, produce the matrix A. Verify that 
it is symmetric and orthogonal. 
(b) Show that, in general, all such matrices A are Hermitian and unitary. 
Let W be a subspace of R”. For x & R", define 


e(x) = Infimum||x — ||, 
yew 


Let {u,,..., u,,} be an orthonormal basis of W, where m is the dimension 
of W. Extend this to an orthonormal basis {u,,...,uU,,,---,4,} of all 
of R". 


(a) Show that 


n 2 1/2 
asd= | E loomdf 


j=mt+ 


and that it is uniquely attained at 


y= Px P= de uyuy 
j= 


This is called the orthogonal projection of x onto W. 
(b) Show P? = P. Such a matrix is called a projection matrix. 
(c) Show P? =P. 
(d) Show (Px, z — Pz) = 0 for all x, z © R". 


(e) Show ||xI]3 = || Px|[3 + |x — Pxll3, for all x © R”. This is a version of 
the theorem of Pythagoras. 


Calculate the eigenvalues and eigenvectors of the following matrices. 


@ [I 4] » [2 


Let y # 0 in R", and define A = yy’, an n X n matrix. Show that A = 0 is 
an eigenvalue of multiplicity exactly n — 1. What is the single nonzero 
eigenvalue? 


13. 


14, 


15. 


16. 


17. 
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Let U be an n X n unitary matrix. 


(a) Show |[Ux||, = ||x|!,, all x © C". Use this to prove that the distance 
between points x and y is the same as the distance between Ux and 
Uy, showing that unitary transformations of C” preserve distances 
‘between all points. 


(b) Let U be orthogonal, and show that 
(Ux, Uy) = (x, y) x,y ER" 


This shows that orthogonal transformations of R” also preserve angles 
between lines, as defined in (7.1.12). 


(c) Show that all eigenvalues of a unitary matrix have magnitude one. 


Let A be a Hermitian matrix of order n. It is called positive definite if and 
only if (Ax, x) > 0 for all x + 0 in C”. Show that A is positive definite if 
and only if all of its eigenvalues are real and positive. Hint: Use Theorem 
7.4, and. expand (Ax, x) by using an eigenvector basis to express an 
arbitrary x © C". , 

Let A be real and symmetric, and denote its eigenvalues by A,,..-, A,» 
repeated according to their multiplicity. Using a basis of orthonormal 


eigenvectors, show that the quadratic form of Problem 7, q(x) = (Ax, x), 
x © R’, can be reduced to the simpler form 


q(x) = 2: arr; 
jr) 


with the {a;} determined from x. Using this, explore the possible graphs 
for ; 


{ Ax, x) = constant 
when 4 is of order 3. 


Assume A is real, symmetric, positive definite, and of order n. Define 
1 
f(x) = 7x Ax -—b™x x,bER" 


Show that the unique minimum of f(x) is given by solving Ax = b for 
x=Aq!p, 


Let f(x) be a real valued function of x € R’, and assume f(x) is three 
times continuously differentiable with respect to the components of x. 
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18. 


19. 


20. 
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Apply Taylor’s theorem 1.5 of Chapter 1, generalized to n variables, to 
obtain 


f(x) = f(a) + (x - @)’Vf(a) 
1 T 
+ 5% ~—a) H(a)(x — a) + O[||x — all?) 
Here 


7] af |? 
Vi(x) = a 


is the gradient of f, and 


d*f(x) 


Ox ;Ox; 


| l<i,j<n 


H(x) = | 


is the Hessian matrix for f(x). The final term indicates that the remaining 
terms are smaller than some multiple of ||x — a|/? for x close to a. 

If a is to be a local maximum or minimum, then a necessary condition is 
that Vf(x) = 0. Assuming V/f(a) = 0, show that a is a strict (or unique) 
local minimum of f(x) if and only if H(a) is positive definite. [Note that 
H(x) is always symmetric.] 


Demonstrate the relation (7.2.28). 


Recall the notation used in Theorem 7.5 on the singular value decomposi- 
tion of a matrix A. 


(a) Show that p?,...,2 are the nonzero eigenvalues of A*A and AA*, 
with corresponding eigenvectors U®,...,U“ and V™,...,V™, re- 
spectively. The vector U‘/) denotes column j of U, and similarly for 
VY) and V. 

(b) Show that AUY = pV, AVY = pu”, 1 sj <2. 

(c) Prove r = rank(A). 


For any polynomial | P(x) = by + bx +--+ +5,,x™, and for A any square 
matrix, define 


p(A) = bol + b,A + «++ +b,,A” 


Let A be a matrix for which the Jordan canonical form is a diagonal 


32. 


33. 


Consider the matrix 


6 1 1 
1 6 1 
1 1 6 

A=10 1 1 
0 


Show A is nonsingular. Fu 


In producing cubic interpo 
necessary to solve the linea 


SOL vo] 
> 
cE 


2 


All h, > 0, i=1,...,m. 1 


‘show that A is nonsingular 


Mi h; 
er in ( a 


for the eigenvalues of A. 


Let A be a square matrix, 
is nonsingular. Such a mat 


21. 


22. 


26. 


27. 


PROBLEMS 501 


matrix, ' 
P-l4P = D = diag{d,,...,A,] 


For the characteristic polynomial f,(A) of A, prove f,(A) = 0. (This result 
is the Cayley—Hamilton theorem. It is true for any square matrix, not just 
those that have a diagonal Jordan canonical form.) Hint: Use the result 
A = PDP™' to simplify f,( A). 


Prove the following: for x © C" 


(a) [Ixll. <llxth < alll. 
(b) [lle < lal, < Vall. 
< yn{lxll2 


Let A be a real nonsingular matrix of order n, and let {{- {{,, 
vector norm on R”. Define 


II>Hle = IAI, 


(ce) [lll, <tr < 


denote a 


x © R’ 
Show that |] - ||, is a vector norm on R”. 


Show 


n l/p 
Limit} )° |x,J?| = Max |x,|  x«eC" 
l<isa 


pra jul 
This justifies the use of the notation ||x|],, for the right side. 


For any matrix norm, show that (a) |{7]] = 1, and (b) JA > (1 Ail). For 
an operator norm, it is immediate from (7.3.11) that ||J|| = 1. 


v4 
Derive formula (7.3.17) for the operator matrix norm |{All,,- 


Define a vector norm on R” by 


x € R’ 


124 
xl = — D L,1 
n. 
j=l 


What is the operator matrix norm associated with this vector norm? 
Let A be a square matrix of order n Xn. 
(a) Given the eigenvalues and eigenvectors of ‘A, determine those 


of (1) A™ for m22, (2) Aq}, assuming A is nonsingular, and 
(3) A + cl, c = constant. 
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(b) Prove | Aj], = 7,(A) when A is Hermitian. 


(c) For A arbitrary and U unitary of the same order, show ||AU]|, = 
|UAll2 = Allo. 


Let A be square of order n X n. 
(a) Show that F(AU) = F(UA) = F(A), for any unitary matrix U. 


(b) If A is Hermitian, then show that 


_ F(A) = yA + 0+ +, 


where A,,..., A,, are the eigenvalues of A, repeated according to their 
multiplicity. Furthermore, 


1 
FPA) S$ lll s F(A) 


Recalling the notation of Theorem 7.5, show 


Alla = Ba F(A) = juz +--+ +p 


Let A be of order n X n. Show 
| trace (.A)| < nr,(A) 
If A is symmetric and positive definite, show 


trace (A) > r,(A) 


Show that the infinite series 


converges for any square matrix A, and denote the sum of the series by e4. 
(a) If A = P-'BP, show that e4 = P~ ep. 


(b) Let A,,...,A, denote the eigenvalues of A, repeated according to 


their multiplicity, and show that the eigenvalues of e4 are e™!,..., e**. 
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32. Consider the matrix 


6 1 1 0 0 
1 6 1 1 0 
1 1 6 I 1 0 
A=10 1 1 6 1 J 0 0 
1 
1 1 6 1 
0 wie 0 1 1 6 


Show A is nonsingular. Find a bound for ||A~'I],, and ||A7}[>. 


33. In producing cubic interpolating splines in Section 3.7 of Chapter 3, it was 
necessary to solve the linear system AM = D of (3.7.21) with 


a — 0 0 
3 6 
h, hth, h, 
6 3 6 
A=| -: : 
hn-y heen + hy hi, 
6 3 6 
0 0 h,, fs 
6 3 


All 4; > 0, i=1,...,m. Using one or more of the results of Section 7.4, 
show that A is nonsingular. In addition, derive the bounds 


1 Z 
g Min (A,) <|A| <Maxh,; AE€Eo(A). 


for the eigenvalues of A. 


34. Let A be a square matrix, with A” = 0 for some m > 2. Show that I-A 
is nonsingular. Such a matrix A is called nilpotent. 
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NUMERICAL SOLUTION 
OF SYSTEMS OF LINEAR EQUATIONS 


Systems of linear equations arise in a large number of areas, both directly in 
modeling physical situations and indirectly in the numerical solution of other 
mathematical models. These applications occur in virtually all areas of the 
physical, biological, and social sciences. In addition, linear systems are involved 
in the following: optimization theory; solving systems of nonlinear equations; the 
approximation of functions; the numerical solution of boundary value problems 
for ordinary differential equations, partial differential equations, and integral 
equations; statistical inference; and numerous other problems. Because of the 
widespread importance of linear systems, much research has been devoted to 
their numerical solution. Excellent algorithms have been developed for the most 
common types of problems for linear systems, and some of these are defined, 
analyzed, and illustrated in this chapter. 
The most common type of problem is to solve a square linear system 


Ax=b5 


of moderate order, with coefficients that are mostly nonzero. Such linear systems, 
of any order, are called dense. For such systems, the coefficient matrix A must 
generally be stored in the main memory of the computer in order to efficiently 
solve the linear system, and thus memory storage limitations in most computers 
will limit the order of the system. With the rapid decrease in the cost of computer 
memory, quite large linear systems can be accommodated on some machines, but 
it is expected. that for most smaller machines, the practical upper limits on the 
order will be of size 100 to 500. Most algorithms for solving such dense systems 
are based on Gaussian elimination, which is defined in Section 8.1. It is a direct 
method in the theoretical sense that if rounding errors are ignored, then the exact 


answer is found in a finite number of steps. Modifications for improved error 


behavior with Gaussian elimination, variants for special classes of matrices, and 
error analyses are given in Section 8.2 through Section 8.5. 

A second important type of problem is to solve Ax = 6 when A is square, 
sparse, and of large order. A sparse matrix is one in which most coefficients are 
zero. Such systems arise in a variety of ways, but we restrict our development to 
those for which there is a simple, known pattern for the nonzero coefficients. 
These systems arise commonly in the numerical solution of partial differential 
equations, and an example is given in Section 8.8. Because of the large order of 
most sparse systems of linear equations, sometimes as large as 10° or more, the 
linear system cannot usually be solved by a direct method such as Gaussian 
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elimination. Iteration methods are the preferred method of solution, and these 


are introduced in Section 8.6 through Section 8.9. 
For solving dense square systems of moderate order, most computer centers 


have a set of programs that can be used for a variety of problems. Students 
should become acquainted with those at their university computer center and use 
them to further illustrate the material of this chapter. An excellent package is 
called LINPACK. and it is described in Dongarra et al. (1979). It is widely 
available, and we will make further reference to it later in this chapter. 


8.1. Gaussian Elimination 


This is the formal name given to the method of solving systems of linear 
equations by successively eliminating unknowns and reducing to systems of lower 
order. It is the method most people learn in high school algebra or in an 
undergraduate linear algebra course (in which it is often associated with produc- 
ing the row-echelon form of a matrix). A precise definition is given of Gaussian 
elimination, which is necessary when implementing it on a computer and when 
analyzing the effects of rounding errors that occur when computing with it. 

To solve Ax = b, we reduce it to an equivalent system Ux = g, in which U is 
upper triangular. This system can be easily solved by a process of back-substitu- 
tion. Denote the original linear system by AMx = 5, 


AM = [aP] 6 = [bM,..., 0]? 1 <i fxn 
in which n is the order of the system. We reduce the system to the triangular 
form Ux = g by adding multiples of one equation to another equation, eliminat- 
ing some unknown from the second equation. Additional row operations are used 
in the modifications given in succeeding sections. To keep the presentation 
simple, we make some technical assumptions in defining the algorithm; they are 
removed in the next section. 


Gaussian elimination algorithm 


STEP 1: Assume a) # 0. Define the row multipliers by 


These are used in eliminating the x, term from equations 2 through n. 
Define 


a®=aM— maa) . i, f=2,...50 
b® = bm) Ze mb i=2,...,1 


Also, the first rows of A and b are left undisturbed, and the first 


STEP k: 
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column of A, below the diagonal, is set to zero. The system 
A®x = 5) looks like 


1 1 es 1 (1) 
af? af? af? xy by 
2 2 2 

0 a@ -- a@|ix,]_ [oP 
2 2 2 

0 aQ +. a@ffx,| [a2 


We continue to eliminate unknowns, going onto columns 2, 3, etc., 
and this is expressed generally in the following. 

Let 1 <k <n-—1. Assume that Ax = b() has been constructed, 
with x,,%X,--.,X,—, eliminated at successive stages, and A‘) has 
the form 


1 1 1 
af> at, — af? 
2 2 
0 aQ a? 
Aw as : ae : 
k k 
) vee 1) al’) seg af) 
k k 
O -- 0 a ... ah 


Assume a{i) # 0. Define the multipliers 


k 
awe 


Mpeg bHktidy...a (8.1.1) 
kk 


Use these to remove the unknown x, from equations k + 1 through 
n. Define 
k+1) _ 

afi = af? — myal 

Bf) = BM) — mi, bf i, f= k+1,...,n (8.1.2) 
The earlier rows 1 through k are left undisturbed, and zeros are 
introduced into column k below the diagonal element. 

By continuing in this manner, after n—1 steps we obtain 


AMM = BI), : 


1 1 1 
a) idee af) x bf ) 
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For notational convenience, let U = A‘) and g = b‘"). The system 
Ux = g is upper triangular, and it is quite easy to solve. First 


En 


nan 


and then 


1 n 
= ele > ws k=n-1,n-2,...,1 (8,1.3) 
Ugg jHketl 
This completes the Gaussian elimination algorithm. 
Example Solve the linear system 

x, +2x,+ x,=0 
2x, + 2x, + 3x, =3 (8.1.4) 

.— X — 3x =2 
To simplify the notation, we note that the daencwas X1, X2, Xz never enter into 


the algorithm until the final step. Thus we represent the preceding linear system 
with the augmented matrix 


1 2 1/0 
{[Ajb] =] 2 2 3| 3 
—-1 -3 0} 2 


The row operations are performed on this augmented matrix, and the unknowns 
are given in the final step. In the following diagram, the multipliers are given next 
to an arrow, corresponding to the changes they cause: 


1 2 11] 0 1 2 1/0 

2 2 3/3 a5 Oo -2 1] 3 

-1 -3 0/2] ,n-; LO -1 1/2 
31 

|mant 

1 2 1/0 

[Ulg]= 0 -2 113 

0 0 33 


Solving Ux = g, 
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Triangular factorization of a matrix It is convenient to keep the multipliers m, ,, 
since we often want to solve Ax = b with the same A but a different vector b. In 
the computer the elements a{**”, j > i, always are stored into the storage for 
a(), The elements below the diagonal are being zeroed, and this provides a 
convenient storage for the elements m,,. Store m,, into the space originally used 
to store a;;, 1 > j. 

There is yet another reason for looking at the multipliers m,, as the elements 
of a matrix. First, introduce the lower triangular matrix 


1 0 0 0 
m 1 0 0 

Pos oe . 
mn Mn 1 


Theorem 8.1 If L and U are the lower and upper triangular matrices defined 
previously using Gaussian elimination, then 


A=LU (8.1.5) 


Proof This proof is basically an algebraic manipulation, making use of defini- 
tions (8.1.1) and (8.1.2). To visualize the matrix element (LU), ,, use the 
vector formula — , 


uy; 
uy 
(LU) i; = [maq,.--. m;,;-1,1,0,..., 0 ; 
0 
For i <j, 


(LU);; = My, + Myo; + 00° $m; 1M; 5 + Yi, ; 
i-l : 
= k i 
“2 2; m,a\) = al) 
k=l 


i-1 
E [agp — af] + ofp 


k= 


= (l) 
ai ai; 
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For i >, 
(LU) ij; = mam + + +m uy 
jr 
= k 
=) mai) + mal) 
k=1 
j-i 
= (kK) __ (k+l j 
= ¥ fal — aft] + afp 
k=1 
tet (1) Ss 
=a=a,, 
This completes the proof. | 


The decomposition (8.1.5) is an important result, and extensive use is made of 
it in developing variants of Gaussian elimination for special classes of matrices. 
But for the moment we give only the following corollary. 


Corollary With the matrices A, L, and U as in Theorem 8.1, 


det (A) = wuz) ++ Unn 


Proof By the product rule for determinants, 
det (A) = det(L) det(U) 
Since L and U are triangular, their determinants are the product of their 


. diagonal elements. The desired result follows easily, since det(L) = 1. 
a 


Example For the system (8.1.4) of the previous example, 


1 0 0 1 2 1 
a4 44 0 o 3 


It is easily verified that A = LU. Also det(A) = det(U) = ~1. 


Operation count To analyze the number of operations necessary to solve 
Ax = b using Gaussian elimination, we will consider separately the creation of L 
and U from A, the modification of b to g, and finally the solution of x. 


1. Calculation of L and U. At step 1, n — 1 divisions were used to calculate the 
multipliers m,, 2<i<n. Then (n— 1)? multiplications and (n — 1)? 
additions were used to create the new elements a{. We can continue in this 
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Table 8.1 Operation count for LU decomposition of a matrix 


Step k Additions Multiplications Divisions 
1 (n—- 1)? (1-1)? n-1 
2 (n — 2)? (n — 2)? n-2 

n-1 1 1 1 

n(n—-1)(Qn-1 n(n—1)(2n-1 n(n-1 

a Cn eren GREY ce 


way for each step. The results are summarized in Table 8.1. The total value 
for each column was obtained using the identities 


P P 

p(p +1) , P(p+1)(2p +1) 

i= a > 1 
ae i 6 


Traditionally, it is the number of multiplications and divisions, counted 
together, that is used as the operation count for Gaussian elimination. On 
earlier computers, additions were much faster than multiplications and 
divisions, and thus additions were ignored in calculating the cost of many 
algorithms. However, on moder computers, the time of additions, multipli- 
cations, and divisions are quite close in size. For a convenient notation, let 
MOD(-) and AS(-) denote the number of multiplications and divisions, and 
the number of additions and subtractions, respectively, for the computation 
of the quantity in the parentheses. 

For the LU decomposition of A, we have 


n(n?-1) 


-MD(LU) = ——— + — 
(Lu) = et 
(8.1.6) 
peieg n(n—1)(2n-1) . 
(LU) = F =a 
The final estimates are valid for larger values of n. 
2. Modification of b to g = b™: 
n(n-1 
MD(g) =(n-—1)+(n-2)+---4+1= ( 5 
(8.1.7) 
n(n - 1) 
AS(g) = 
3. Solution of Ux = g 
a(n+1 n(n-1 
MD(x) = ned AS(x) = ( ; ) (8.1.8) 
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4. Solution of Ax = b. Combine (1) through (3) to get 


n3 ee ee 
| MD(LU,x)= = +n sag gt 
(8.1.9) 


n(n—1\)(2n4+5 ] 
| as(zu, x) = M2 VEn*) Lt 


ne 


The number of additions is always about the same as the number of 

multiplications and divisions, and thus from here on, we consider only the 
latter. The first thing to note is that solving Ax = b is comparatively cheap 
when compared to such a supposedly simple operation as. multiplying two 
n Xn matrices. The matrix multiplication requires n? operations, and the 
solution of Ax = b requires only about 4? operations. 
Second, the main cost of solving Ax = b is in producing the decomposi- 
| tion A = LU. Once it has been found, only n? additional operations are 
| necessary to solve Ax = b. After once solving Ax = b, it is comparatively 
cheap to solve additional systems with the same coefficient matrix, provided 
the LU decomposition has been saved. 

Finally, Gaussian elimination is much cheaper than Cramer’s rule, which 
i uses determinants and is often taught in linear algebra courses [for example, 
see Anton (1984), sec. 2.4]. If the determinants in Cramer’s rule are com- 
puted using expansion by minors, then the operation count is (m + 1)!. For 
| n = 10, Gaussian elimination uses 430 operations, and Cramer’s rule uses 
39,916,800 operations. This should emphasize the point that Cramer’s rule is 
not a practical computational tool, and that it should be considered as just a 

theoretical mathematics tool. 
5. Inversion of A. The inverse A~! is generally not needed, but it can be 
produced by using Gaussian elimination. Finding A! is equivalent to 
.. .. solving the equation AX = IJ, with X an n X n unknown matrix. If we write 
X and J in terms of their columns, 


X= [x,..., x] T= [e,..., e”)] 


i then solving AX = I is equivalent to solving the n systems 


i 
} 
t 
i 
{ 
| 
i 
' 


Ax® = 0,0, Ax) =o (8.1.10) 


all having the same coefficient matrix A. Using (1)-(3) 


4 


4 n 
MD(A7) = rl - 3 = 3” 


Calculating A~? is four times the expense of solving Ax = b for a single 
vector b, not n times the work as one might first imagine. By careful 
attention to the details of the inversion process, taking advantage of the 
special form of the right-hand vectors e“,..., e ”, it is possible to further 
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reduce the operation count to exactly 
MD(A7") =n3 (8.1.11) 


However, it is still wasteful in most situations to produce A~! to solve 
Ax = b. And there is no advantage in saving A~! rather than the LU 
decomposition to solve future systems Ax = b.In both cases, the number of 
multiplications and divisions necessary to solve Ax =. b is exactly n?. 


8.2. Pivoting and Scaling in Gaussian Elimination 


At each stage of the elimination process in the last section, we assumed the 
appropriate pivot element a{*) + 0. To remove this assumption, begin each step 
of the elimination process by switching rows to put a nonzero element in the 
pivot position. If none such exists, then the matrix must be singular, contrary to 
assumption. 

It is not enough, however, to just ask that the pivot element be nonzero. Often 
an element would be zero except for rounding errors that have occurred in 
calculating it. Using such an element as the pivot element will result in gross 
errors in the further calculations in the matrix. To guard against this, and for 
other reasons involving the propagation of rounding errors, we introduce partial 
pivoting and complete pivoting. 


Definition 1. Partial Pivoting. For 1 < k < n — 1, in the Gaussian elimination 
process at stage k, let 


c, = Max [al (8.2.1) 
ksisn 
Let i be the smallest row index, i > k, for which the maximum c, is- 
attained. If i > k, then switch rows k and iin A and 5, and proceed 
with step k of the elimination process. All of the multipliers will now 
satisfy 


lmyl S51 iwk+d,...,.0 (8.2.2) 


This aids in preventing the growth of elements in A‘*) of greatly 
varying size, and thus lessens the possibility for large loss of signifi- 
cance errors. , 


2. Complete Pivoting. Define 


Switch rows of A and b and columns of A to bring to the pivot 
position an element giving the maximum c,. Note that with a column 
switch, the order of the unknowns is changed. At the completion of 
the elimination and back substitution process, this must be reversed. 
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Complete pivoting has been proved to cause the roundoff error in Gaussian 
elimination to propagate at a reasonably slow speed, compared with what can 
happen when no pivoting is used. The theoretical results on the use of partial 
pivoting are not quite as good, but in virtually all practical problems, the error 
behavior is essentially the same as that for complete pivoting. Comparing 
operation times, complete pivoting is the more expensive strategy, and thus, 
partial pivoting is used in most practical algorithms. Henceforth we always mean 
partial pivoting when we used the word pivoting. The entire question of roundoff 
error propagation in Gaussian elimination has been analyzed very thoroughly by 
J. H. Wilkinson [e.g., see Wilkinson (1965), pp. 209-220], and some of his results 
are presented in Section 8.4. 


Example We illustrate the effect of using pivoting by solving the system 
.729x + 8ly + .9z = .6867: 
xty+z= 8338 
1.331x + 1.2ly + 1.1z = 1.000 (8.2.3) 
The exact solution, rounded to four significant digits, is 
x= 2245 y= .2814 z= 3279 (8.2.4) 


Floating-point decimal arithmetic, with four digits in the mantissa, will be 
used to solve the linear system. The reason for using this arithmetic is to show the 
effect of working with only a finite number of digits, while keeping the presenta- 
tion manageable in size. The augmented matrix notation will be used to represent 
the system (8.2.3), just as was done with the earlier example (8.1.4). 


1. Solution without pivoting. 


7290 8100 .9000| 6867 
1.000 1.000 1.000 | .8338 
1.331 1.210 1.100 | 1.000 

mz, 71.372 
m3, =1.826 
7290 8100  .9000| 6867 
0.0 — 1110 —.2350} —.1084 
0.0 ~ 2690 —.5430| —.2540 
|oon24e 
7290 8100 ~—-«.9000 6867 
0.0 —.1110 —.2350 | —.1084 


0.0 0.0 02640 008700 
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The solution is 


x= 2251 y= .2790 z= .3295 (8.2.5) 


2. Solution with pivoting. To indicate the interchange of rows i and /, we will 


use the notation 7; © Fr, . 


7290 ~=—-.8100 = .9000 6867 
1.0000 1.0000 1.0000] .8338 
1.331 1.210 1.100 | 1.000 
rae my = .7513 
FF my, = 5477 
1.331 1.210 1.100 | 1.000 
0.0 .09090 = =.1736 | =.08250 
0.0 .1473 .2975 .1390 


ne of = 6171 


1.331 1.210 1.100 -| 1.000 


0.0 1473 2975 .1390 
-0.0 0.0 — .01000 | — .003280 
The solution is 
x = .2246 y = .2812 z= .3280 (8.2.6) 


The error in (8.2.5) is from seven to sixteen times larger than it is for (8.2.6); 
depending on the component of the solution being considered. The results in 
(8.2.6) have one more significant digit than do those of (8.2.5). This illustrates the 
positive effect that the use of pivoting can have on the error behavior for 
Gaussian elimination. 


Pivoting changes the factorization result (8.1.5) given in Theorem 8.1. The 
result is still true, but in a modified form. If the row interchanges induced by 
pivoting were carried out on A before beginning elimination, then pivoting would 
be unnecessary. Row interchanges on A can be represented by premultiplication 
of A by an appropriate permutation matrix P, to get PA. Then Gaussian 
elimination on PA leads to 


LU = PA (8.2.7) 


where U is the upper triangular matrix obtained in the elimination process with 
pivoting. The lower triangular matrix L can be constructed using the multipliers 
from Gaussian elimination with pivoting. We omit the details, as the actual 
construction is unimportant. 
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Example From the preceding example with pivoting, we form . 


1.000 0.0 0.0 1.331 1.210 1.100 
L={| 5477 1.000 0.0 U =} 0.0 1473 2975 
7513-6171 _~—-1.000 0.0 0.0 — 01000 
When multiplied, 


1.331 1.210 1.100 0 0 1 
LU =| .7289 8100 .9000| = PA P=}]1 0 0 


1.000 1.000 1.000 0 1 0 


The result PA is the matrix A with first, rows 1 and 3 interchanged, and then 


rows 2 and 3 interchanged. This illustrates (8.2.7). 


Scaling It has been observed empirically that if the elements of the coefficient 
matrix A vary greatly in size, then it is likely that large loss of significance errors 
will be introduced and the propagation of rounding errors will be worse. To 
avoid this problem, we usually scale the matrix A so that the elements vary less. 
This is usually done by multiplying the rows and columns by suitable constants. 
The subject of scaling is not well understood currently, especially how to 
guarantee that the effect of rounding errors in Gaussian elimination will be made 
smaller by such scaling. Computational experience suggests that often all rows 
should be scaled to make them approximately equal in magnitude. In addition, 
all columns can be scaled to make them of about equal magnitude. The latter is 
equivalent to scaling the unknown components x; of x, and it can often be 
interpreted to say that the x; should be measured in units of comparable size. 

There is no known a priori strategy for picking the scaling factors so as to 
always decrease the effect of rounding error propagation, based solely on a 
knowledge of A and b. Stewart (1977) is somewhat critical of the general use of 
scaling as described in the preceding paragraph. He suggests choosing scaling 
factors so as to obtain a rescaled matrix in which the errors in the coefficients are 
of about equal magnitude. When rounding is the only source of error, this leads 
to the strategy of scaling to make all elements of about equal size. The 
LINPACK programs do not include scaling, but they recommend a strategy 
along the lines indicated by Stewart [see Dongarra et al. (1979), pp. I7-I12 for a 
more extensive discussion; for other discussions of scaling, see Forsythe and 
Moler (1967), chap. 11 and Golub and Van Loan (1983), pp. 72-74]. 

If we let B denote the result of row and column scaling in A, then 


B= D,AD, 


where D, and D, are diagonal matrices, with entries the scaling constants. To 
solve Ax = b, observe that 


D,AD,(Dz!x) = D,b 
Thus we solve for x by solving 


Bz=D,b x=Dy,z (8.2.8) 
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The remaining discussion is restricted to row scaling, since some form of it is 
fairly widely agreed upon. 
Usually we attempt to choose the coefficients so as to have 


Max |b,,,)=1 i=1,...,7 (8.2.9) 
isjsn 


where B = [b,,] is the result of scaling A. The most straightforward approach is 
to define 


5; = Max Ja;/| f=1,...,0 
lsjsn 
a; ° 
b; = > J = Voteas n (8.2.10) 


But because this introduces an additional rounding error into each element of the 
coefficient matrix, two other techniques are more widely used. 


1. Scaling using computer number base. Let B denote the base used in the 
computer arithmetic, for example, 8 = 2 on binary machines. Let r,; be the 
smallest integer for which 8” > s,. Define the scaled matrix B by , 


ai; ee 
b= a i,j=l,...,2 (8.2.11) 


No rounding is involved in defining 5,,, only a change in the exponent in the 
floating-point form for a;,. The values of B satisfy 


B~} < Max [b;,| <1 


lsjsn 
and thus (8.2.9) is satisfied fairly well. 


2. Implicit. scaling. The use of scaling will generally change the choice of pivot 
elements when pivoting is used with Gaussian elimination. And it is only with 
such a change of pivot elements that the results in Gaussian elimination will be 
changed. This is due to a result of F. Bauer [in Forsythe and Moler (1967), p. 38], 
which states that if the scaling (8.2.11) is used, and if the choice of pivot elements 
is forced to remain the same as when solving Ax = 5, then the solution of (8.2.8) 
will yield exactly the same computed value for x. Thus the only significance of 
scaling is in the choice of the pivot elements. 

For implicit scaling, we continue to use the matrix A. But we choose the pivot 
element in step & of the Gaussian elimination algorithm by defining 


c, = Max lave (8.2.12) 
kK” ksisn sf?) =. 


replacing the definition (8.2.1) used in defining partial pivoting. Choose the 
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smallest index i> k that yields c, in (8.2.12), and if i # k, then interchange 
rows i and k. Also interchange s‘*) and s(*), and denote the resulting new 
values by s{**", j = 1,...,n, most of which have not changed. Then proceed 
with the elimination algorithm of Section 8.1, as before. This form of scaling 
seems to be the form most commonly used in current published algorithms, if 


scaling is being used. 


An algorithm for Gaussian elimination We first give an algorithm, called Factor, 
for the triangular factorization of a matrix A. It uses Gaussian elimination with 


‘partial pivoting, combined with the implicit scaling of (8.2.10) and (8.2.12). We 


then give a second algorithm, called Solve, for using the results of Factor to solve 
a linear system Ax = b. The reason for separating the elimination procedure into 
these two steps is that we will often want to solve several systems Ax = b, with 
the same A, but different values for b. 


Algorithm Factor (A, n, Pivot, det, ier) 


1 Remarks: A is an n X x matrix, to be factored using the LU 
decomposition. Gaussian elimination is used, with partial pivot- 
ing and implicit scaling in the rows. Upon completion of the 
algorithm, the upper triangular matrix U will be stored in the 
upper triangular part of A; and the multipliers of (8.1.1), which 
make up L below its diagonal, will be stored in the correspond- 
ing positions of A. The vector Pivot will contain a record of all 
row interchanges. If Pivot (k) = k, then no interchange was used 
in step k of the elimination process. But if Pivot(k) = i # k, 
then rows i and & were interchanged in step & of the elimination 
process. The variable det will contain det-(A) on exit. 

The variable ier is an error indicator. If ier = 0, then the 
routine was completed satisfactorily. But for ier = 1, the matrix 
A was singular, in the sense that all possible pivot elements were 
zero at some step of the elimination process. In this case, all 
computation ceased, and the routine was exited. No attempt is 
made to check on the accuracy of the computed decomposition 
of A, and it can be nearly singular without being detected. 


2 det:=1 


3 s,;= Max |a;;|, i= 1,..., 0. 
lsjsn 


4 Do through step 16 for k = 1,...,2— 1. 


Giz 


Ss; 


5 c, = Max 


ksisn 


6 Let iy be the smallest index i =k for which the maximum in 
Step 5 is attained. Pivot (k) = ip. 


7 If c, = 0, then ier = 1, det == 0, and exit from the algorithm. 
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If ig = k, then go to step 11. 
det == —det 
Interchange a,, and a, ;, j = k,...,n. Interchange s, and s;,,. 


Do through step 14 fori=k+1,...,7. 


pe eee Giz 
ik OT OG 
Qk 
a;,;=a,,— m,a,, j=k+1,...,0 
End loop on i. 


det = a,,- det 
End loop on k. 


det = a,, det; ier = 0 and exit the algorithm. 


Algorithm Solve (A, n, b, Pivot) 


1. 


10. 


Remarks: This algorithm will solve the linear system Ax = b. 
It is assumed that the original matrix A has been factored 
using the algorithm Factor, with the row interchanges recorded 
in Pivot. The solution will be stored in 5 on exit. The matrix A 
and vector Pivot are left unchanged. 


Do through step 5 for k = 1,2,...,” — 1. 
If i = Pivot(k) # k, then interchange b, and 5,. 
6, = 6, -— a,b,,i=k+1,...,0 


End loop on k. 


= 


Do through step 9 for i = 2 —1,...,1. 
1 n A 

b= =(» = 2. a,;) 
aj; jritl 

End loop on i. 


Exit from algorithm. 


i 
i 
| 
: 
| 
i 
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The earlier example (8.2.3) will serve as an illustration. The use of implicit 
sealing in this case will not require a change in the choice of pivot elements with 
partial pivoting. Algorithms similar to Factor and Solve are given in a number of 
references [see Forsythe and Moler (1967), chaps. 16 and 17 and Dongarra et al. 
(1979), chap. 1, for improved versions of these algorithms]. The programs in 
LINPACK will also compute information concerning the condition or stability of 
the problem Ax = b and the accuracy of the computed solution. 

An important aspect of the LINPACK programs is the use of Basic Linear 
Algebra Subroutines (BLAS). These perform simple operations on vectors, such as 
forming the dot product of two vectors or adding a scalar multiple of one vector 
to another vector. The programs in LINPACK use these BLAS to replace many 
of the inner loops in a method. The BLAS can be optimized, if desired, for each 
computer; thus, the performance of the main LINPACK programs can also be 
easily improved while keeping the main source code machine-independent. For a 
more complete discussion of BLAS, see Lawson et al. (1979). 


83 Variants of Gaussian Elimination 


There are many variants of Gaussian elimination. Some are modifications or 
simplifications, based on the special properties of some class of matrices, for 
example, symmetric, positive definite matrices. Other variants are ways to rewrite 
Gaussian elimination in a more compact form, sometimes in order to use special 
techniques to reduce the error. We consider only a few such vauanls, and later 
make reference to others. 


Gauss—Jordan method This procedure is much the same as regular elimination 
including the possible use of pivoting and scaling. It differs in eliminating the 
unknown in equations above the diagonal as well as below it. In step & of the 
elimination algorithm choose the pivot element as before. Then define 


(k) . 
(kth) A - 
ay; = rm Juk,....n 
k) 
Biktd La 
k 
ay 


Eliminate the unknown x, in equations both above and below equation k. 
Define 
k) a (k+1 
aff? = aff) ~ afPafk? 
(8.3.1) 
BEATD = pO) — q(bp(e+D 
I i i 


for j=k,...,n,i=1,...,2, i # k. The Gauss—Jordan method is equivalent to 
the use of the reduced row-echelon form of linear algebra texts [for example, see 
Anton (1984), pp. 8-9]. 
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This procedure will convert the augmented matrix [A|b] to [7|5], so that at 
the completion of the preceding elimination, x = b'. To solve Ax = b by this 
technique requires 

n(n—1) 


5 a (8.3.2) 


multiplications and divisions. This is 50 percent more than the regular elimina- 
tion method; consequently, the Gauss~Jordan method should usually not be used 
for solving linear systems. However, it can be used to produce a matrix inversion 
program that uses a minimum of storage. By taking special advantage of the 
special structure of the right side in AX = J, the Gauss—Jordan method can 
produce the solution X = A~! using only n extra storage locations, rather than 
the normal n? extra storage locations. Partial pivoting and implicit scaling can 
still be used. 


Compact methods It is possible to move directly from a matrix A to its LU 
decomposition, and this can be combined with partial pivoting and scaling. If we 
disregard the possibility of pivoting for the moment, then the result 


A=LU (8.3.3) 


leads directly to a set of recursive formulas for the elements of L and U. 

There is some nonuniqueness in the choice of L and U, if we insist only that L 
and U be lower and upper triangular, respectively. If A is nonsingular, and if we 
have two decompositions 


A=LU, = LU, (8.3.4) 
then 
Ly Us = U,U;! (8.3.5) 


The inverse and the products of lower triangular matrices are again lower 
triangular, and similarly for upper triangular matrices. The left and right sides of 
(8.3.5) are. lower and upper triangular, respectively. Thus they must equal a” 
diagonal matrix, call it D, and 


L,=L,D U,=D~U, (8.3.6) 


The choice of D is tied directly to the choice of the diagonal elements of either L 
or U, and once they have been chosen, D is uniquely determined. _ 

If the diagonal elements of L are all required to equal 1, then the resulting 
decomposition A = LU is that given by Gaussian elimination, as in Section 8.1. 
The associated compact method gives explicit formulas for /;; and u; j and it is” 
known as Doolittle’s method. If we choose to have the diagonal elements of U all 
equal 1, the associated compact method for calculating A = LU is called Crout’s 
method. There is only a multiplying diagonal matrix to distinguish it from 
Doolittle’s method. For an algorithm using Crout’s algorithm for the factoriza- 
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tion (8.3.3), with partial pivoting and implicit scaling, see the program unsymdet 
in Wilkinson and Reinsch (1971, pp. 93-110). In some situations, Crout’s method 
has advantages over the usual Doolittle method. 

The principal advantage of the compact formula is that the elements /,, and 
u,, all involve inner products, as illustrated below in formula (8.3.14)-(8.3.15) for 
the factorization of a symmetric positive definite matrix. These inner products 
can be accumulated using double precision arithmetic, possibly including a 
concluding division, and then be rounded to single precision. This way of 
computing inner products was discussed in Chapter 1 preceding the error 
formula (1.5.19). This limited use of double precision can greatly increase the 
accuracy of the factors L and U, and it is not possible to do this with the regular 
elimination method unless all operations and storage are done in double preci- 
sion {for a complete discussion of these compact methods, see Wilkinson (1965), 
pp. 221-228, and Golub and Van Loan (1983), sec. 5.1]. 


The Cholesky method Let A be a symmetric and positive definite matrix of 
order 1. The matrix A is positive definite if 


n 


(Ax, x) = ¥ a,,x,x,>0 (8.3.7) 


DES: 
i=1 j=l 


for all x © R”, x # 0. Some of the properties of positive definite matrices are 
given in Problem 14 of Chapter 7 and Problems 9, 11, and 12 of this chapter. 
Symmetric positive definite matrices occur in a wide variety of applications. 

For such a matrix A, there is a very convenient factorization, and it can be 
carried through without any need for pivoting or scaling. This is called Choleski’s 
method, and it states that we can find a lower triangular real matrix L such that 


A= LLT (8.3.8) 


The method requires only $n(n + 1) storage locations for L, rather than the 
usual n? locations, and the number of operations is about 4n3, rather than the 
number 4n? required for the usual decomposition. 

To prove that (8.3.8) is possible, we give a derivation of L based on induction. 
Assume the result is true for all positive definite symmetric matrices of order 
<n — 1. We show it is true for all such matrices A of order n. Write the desired 


L, of order n, in the form 
£0 


yi x 


with La square matrix of order n — 1, y € R"~}, and x a scalar. The L is to be 
chosen to satisfy A = LL’: 


peste dale] 9 
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with A of order n — 1,c€ R""! and d= Gyn Teal. Since (8.3.7) is true for A, let 
x, = 0 in it to obtain the analogous statement for A, showing A is also positive 
definite and, symmetric. In addition, d > 0, by letting x; = --- =x,_, = 0, 
x, = 1 in (8.3.7). Multiplying in (8.3.9), choose L, by the induction hypothesis to 
satisfy 


£it=A (8.3.1) 
Then choose y by solving | 
Ly=c (8.3.11) 
since L is nonsingular, because det (A) = {det (Z)) 2. Finally, x must satisfy 
yyt+x=d (8.3.12) 
To see that x? must be positive, form the determinant of both sides in (8.3.9), 
obtaining 
[det (Z)]?x? = det (A) (8.3.13) 


Since det (A) is the product of the eigenvalues of A, and since all eigenvalues of 
positive definite symmetric are positive (see Problem 14 of Chapter 7), det (A) is 
positive. Also, by the induction hypothesis, L is real. Thus x? is positive in 
(8.3.13), and we let x be its positive square root. Since the result (8.2.8) is trivially 
true for matrices of order n = 1, this completes the proof of the factorization 
(8.3.8). For another approach, see Golub and Van Loan (1983, sec. 5.2). 

A practical construction of L can be based on (8.3.9)-(8.3.12), but we give one 
based on directly finding the elements of L. Let L = [/;,], with /;, = 0 for j > i. 
Begin the construction of L by multiplying the first row of L times the first 
column of L7 to get 


2s 
I = ay, 


Because A is positive definite, a}, > 0, and /,, = ya,,. Multiply the second row 
of L times the first two columns of L? to get 


rs, 2 2 
nha = ay I, tly = an 


Again, we can solve for the unknowns /,, and Ta In general for i = 1,2,..., 7, 


j-l 
a3; 2 liglix 
bah Si (8.3.14) 


i-l. yi ; 
i= eu - | (8.3.15) 
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The argument in this square root is the term x? in the earlier derivation (8.3.12), 
and /,,; is real and positive. For programs implementing Cholesky’s method, see 
Dongarra et al. (1979, chap. 3) and Wilkinson and Reinsch (1971, pp. 10-30). 

Note the inner products in (8.3.14) and (8.3.15). These can be accumulated in 
double precision, minimizing the number of rounding errors, and the elements /,, 
will be in error by much less than if they had been calculated using only single 
precision arithmetic. Also note that the elements of L remain bounded relative to 
A, since (8.3.15) yields a bound for the elements of row i, using 


+--+ +P =a; (8.3.16) 

The square roots in (8.3.15) of Choleski’s method can be avoided by using a 

slight modification of (8.3.8). Find a diagonal matrix D and a lower triangular 
matrix L, with 1s on the diagonal, such that 

A= LDL" (8.3.17) 

This factorization can be done with about the same number of operations as 

Choleski’s method, about 17}, with no square roots. For further discussion and a 


program, see Wilkinson and Reinsch (1971, pp. 10-30). 


Example Consider the Hilbert matrix of order three, 


i 1 1 
; 2 3 
1 1 #61 
A=]- =- = (8.3.18) 
2 3 4 
1 1 1 
3° 4. 5 
For the Choleski decomposition, 
1 0 0 
1 1 
=. 0 
L=|2  2y¥3 
1 1 1 
3 273 675 
and for (8.3.17), 
1 0 0 1 0 0 
: 1 0 0 : 0 
LS |-2 D= 12 
: 1 1 0.60 : 
3 180 
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For many linear systems in applications, the coefficient matrix A is banded, 
which means 


a,,=0 if |i-j|>m (8.3.19) 


for some small m > 0. The preceding algorithms simplify in this case, with a 
considerable savings in computation time. For such algorithms when A is 
symmetric and positive definite, see the LINPACK programs in Dongarra et al. 
(1979, chap. 4). We next describe an algorithm in the case m = 1 in (8.3.19). 
Tridiagonal systems The matrix A = [a;,,] is tridiagonal if 

a,,=0 for ji-j]>1 (8.3.20) 


This gives the form 


is) 
_ 
ie) 
a 
jo) 
Oo 
Oo 


A=|_ ~ (8.3.21) 


Tridiagonal matrices occur in a variety of applications. Recall the linear system 
(3.7.22) for spline functions in Section 3.7 of Chapter 3. In addition, many 
numerical methods for solving boundary value problems for ordinary and partial 
differential equations involve the solution of tridiagonal systems. Virtually all of 
these applications yield tridiagonal matrices for which the LU factorization can 
be formed without pivoting, and for which there is no large increase in error as a - 
consequence. The precise assumptions on A are given below in Theorem 8.2. 

By considering the factorization A = LU without pivoting, we find that most 
elements of L and U will be zero. And we are lead to the following general 
formula for the decomposition: 


A=LU 
ae 0 olf1 3» 0 0 
by a 0 0 1 y O 
={|0 2b a3 
Yn=1 
0 b, @, || 0 Oo. <i 


We can multiply to obtain a way to recursively compute {a,;} and {y;,}: 
a, = &% OY = Cy 
a,=a;+b7,-, i=2,...,0 - (8.3.22) 


a7; = ¢; i=2,3,....n-1 
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These can be solved to give 


CL 
a = 4 De a, 
C; 
a;= a; biyj-1 VF ne i= 2, 3, Oe ame 1 (8.3.23) 
a, = a, 7 ban 
To solve LUx = f, let Ux =z and Lz = f. Then 
; — 5,2; 
mes prey (eI i=2,3,...,7 
“i i (8.3.24) 
Xp = ZZ, X= 2; Yj%Xi401 i=n-1,n—2,...,1 


The constants in (8.3.23) can be stored for later use in solving the linear system 
Ax = f, for as many right sides f as desired. 

Counting only multiplications and divisions, the number of operations to 
calculate L and U is 2n —2; to solve Ax =f takes an additional 3n — 2 
operations. Thus we need only 5n — 4 operations to solve Ax — f the first time, 
and for each additional right side, with the same A, we need only 3n — 2 
operations. This is extremely rapid. To illustrate this, note that A~' generally is 
dense and has mostly nonzero entries; thus, the calculation of x = Alf will 
require n? operations. In many applications n may be larger than 1000, and thus 
there is a significant savings in using (8.3.23)—(8.3.24) as compared with other 
methods of solution. 

To justify the preceding decomposition of A, especially to show that all the 
coefficients a; # 0, we have the following theorem. 


Theorem 8.2 Assume the coefficients {a;, b;, c;} of (8.3.21) satisfy the following 
conditions: 


LJ, > lel > 0 
2. Ja,| = 1b;| + Jey, iceen 0 i=2,...,n-2 
3. |a,| > [8,| > 0 
Then A is nonsingular, 
lyJ<1  i=1,...,2-1 
|a;| — [8 < |a;| < ja;| + 15;| P= 2,2..,0 


Proof For the proof, see Isaacson and Keller (1966, p. 57). Note that the last 
bound shows |a;| > |c;|, i= 2,...,2— 2. Thus the coefficients of L 
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and U remain bounded, and no divisors are used that are almost zero 
except for rounding error. 

The condition that 5,,c, #0 is not essential. For example, if some 
5, = 0, then the linear system can be broken into two new systems, one 
of order i — 1 and the other of order n — i + 1. For example, if 


a ¢, O 0 


b, a, ec, O 
aaa ce 


then solve Ax = f by reducing it to the following two linear systems, 


Peel lal eee ieee 
by 44\[X4 Is by 4 || X2 fy — €X3 
This completes the proof. a 


Example Consider the coefficient matrix for spline interpolation, in (3.7.22) of 
Chapter 3. Consider 4; = constant in that matrix, and then factor 4/6 from 
every row. Restricting our interest to the matrix of order four, the resulting 
matrix is 


2 1 0 0 
~{1 4 1 =«0 
oe 014 41 
001 2 
Using the method (8.3.23), this has the LU factorization 
2 0 0 0 1 
7 1 <=- O 0 
1 = 0 0 2 
2 2 
Ps 26 oe 
LN die, 1? OG 7 
7 7 
45 0 0 1 = 
26 0 0 0 1 


This completes the example. And it should indicate that the solution of the cubic 
spline interpolation problem, described in Section 3.7 of Chapter 3, is not 
difficult to compute. 


8.4 Error Analysis 
We begin the error analysis of methods for solving Ax = b by examining the 


stability of the solution x relative to small perturbations in the right side b. We 
will follow the general schemata of Section 1.6 of Chapter 1, and in particular, we 
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will study the condition number of (1.6.6). 
Let Ax = b, of order n, be uniquely solvable, and consider the solution of the 


perturbed problem 
AX=b+r (8.4.1) 
Let e = X — x, and subtract Ax = b to get 
Ae=r e=A7}y (8.4.2) 
To examine the stability of Ax = b as in (1.6.6), we want to bound the quantity 


Lid el (8.4.3) 


xi] Hol 


as r ramges over all elements of R”, which are small relative to b. 
From (8.4.2), take norms to obtain 


Wri <HAiliel! = Hell s AT“ 
Divide by || Alj {|x|| in the first inequality and by ||x|| in the second one to obtain 


Ir Hell — WAT MII 
Se ae 
All fot Meet” itll 


The matrix norm is the operator matrix norm induced by the vector norm. Using 
ihe bounds 


NO < HANH xi < Am NB 
we obtain 
1 tell _ tel fal 
+ < Aa} 8.4.4 
Taya) oy ay SPAN A y ead) 


Recalling (8.4.3), this result is justification for introducing the condition number 
of A: 


cond (A) = || ll 47" (8.4.5) 


For each given A, there are choices of b and r for which either of the inequalities 
in (8.4.4) can be made an equality. This is a further reason for introducing 
cond (A) when considering (8.4.3). We leave the proof to Problem 20. 

The quantity cond (A) will vary with the norm being used, but it is always 
bounded below by one, since 


1 < |||] = [4474] < |All [474] = cond (A) 


If the condition number is nearly 1, then we see from (8.4.4) that small relative 
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perturbations in b will lead to similarly small relative perturbations in the 
solution x. But if cond(A) is large, then (8.4.4) suggests that there may be small 
relative perturbations of b that will lead to large relative perturbations in x. 

Because (8.4.5) will vary with the choice of norm, we sometimes use another 
definition of condition number, one independent of the norm. From Theorem 7.8 
of Chapter 7, 


cond (A) > r,(A)r,(A7?) 
Since the eigenvalues of A~! are the reciprocals of those of A, we have the result 


Max [A| 
AGa(A) 
Min |A} 
AGa(A) 


cond (A) > = cond (A). (8.4.6): 


in which o(A) denotes the set of all eigenvalues of A. 
Example Consider the linear system 
Tx, + 10x, = b, 
5x, + 7x, =b, (8.4.7) 


For the coefficient matrix, 


_{7 10 ee | 
a E "| ‘ 5 -7 


Let the condition number in (8.4.5) be denoted by cond (A), when it is generated 
using the matrix norm || - ||,. For this example, 


cond (A), = cond(A),, = (17)(17) = 289, 
cond(A), +223  cond(A), = 198 


These condition numbers all suggest that (8.4.7) may be sensitive to changes in 
the right side b. To illustrate this possibility, consider the particular case 


7x, + 10x, =1 
5x, + Ix, =.7 
which has the solution 
xX,=0 x,=.1 
For the perturbed system, solve 
7%, + 10X, = 1.01 


5¥,+ 7k = .69 
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It has the solution 


Z=-17 £,= 2 


The relative changes in x are quite large when compared with the size of the 
relative changes in the right side b. 


A linear system whose solution x is unstable with respect to small relative 
changes in the right side b is called ill-conditioned. The preceding system (8.4.7) is 
somewhat ill-conditioned, especially if only three or four decimal digit floating- 
point arithmetic is used in solving it. The condition numbers cond(A) and 
cond(A), are fairly good indicators of ill-conditioning. As they increase by a 
factor of 10, it is likely that one less digit of accuracy will be obtained in the 
solution. 

In general, if cond(A), is large: then there will be values of b for which the 
system Ax = b is quite sensitive to changes r in b. Let A, and X,, denote 
eigenvalues of A for which 


A, = Min jA A,| = Max JA 
A,| | | |A,,| sax | | 


and thus 


(8.4.8) 


Let x, and x, be corresponding eigenvectors, with ||x,/|},, = ||x,||., = 1. Then 


Ax =iX,x, 


has the solution x = x,. And the system 
1 
AX =A,x, +A .x, =A, |x, 4 x, 


has the solution 


If cond (A), is large, then the right-hand side has only a small relative perturba- 
tion, 


lle fades} 
bll,, cond (A), - 


But for the solution, we have the much larger relative perturbation 


X¥—-x x : 
IX = leo = IX 7Mhc0 4 (8.4.10) 


Ilo Walle 
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There are systems that are not ill-conditioned in actual practice, but for which 
the preceding condition numbers are quite large. For example, 


i 0 
A = 
i ven 


has all condition numbers cond(A), and cond(A), equal to 10’°. But usually 
the matrix is not considered ill-conditioned. The difficulty is in using norms to 
measure changes in a vector, rather than looking at each component separately. If 
scaling has been carried out on the coefficient matrix and unknown vector, then 
this problem does not usually arise, and then the condition numbers are usually 
an accurate predictor of ill-conditioning. 

As a final justification for the use of cond (A) as a condition number, we give 
the following result. 


Theorem 8.3 (Gastinel) Let A be any nonsingular matrix of order n, and let 
|| - || denote an operator matrix norm. Then 
lee in ale B asingular matrix} (8.4.11) 
cond (A) (All = 


with cond (A) defined in (8.4.5). 
Proof See Kahan (1966, p. 775). | 


The theorem states that A can be well approximated in a relative error sense 
by a singular matrix B if and only if cond (A) is quite large. And from our view, 
a singular matrix B is the ultimate in ‘ill-conditioning. There are nonzero 
perturbations of the solution, by the eigenvector for the eigenvalue A = 0, which 
correspond to a zero perturbation in the right side b. More importantly, there are 
values of b for which Bx = 6 is no longer solvable. 


The Hilbert matrix The Hilbert matrix of order n is defined by 


1 1 1 
1 _ at pe 
2 3 n 
1 1 1 1 | 
H,=| 2 3 4 n+1 (8.4.12) 
1 1 1 
no on+l In-1 


This matrix occurs naturally in solving the continuous least squares approxima- 
tion problem. Its derivation is given near the end of Section 4.3 of Chapter 4, 
with the resulting linear system given in (4.3.14). As was indicated in Section 4.3 
and illustrated following (1.6.9):in Section 1.4 of Chapter 1, the Hilbert matrix is 
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Table 8.2. Condition numbers of Hilbert matrix 


n cond (H, )s n cond (H,,)« 
3 5.24E + 2 7 4.75E + 8 

4 1.55E + 4 8 1.53E + 10 
5 4.77E + 5 9 4.93E +11 
6 1.50E + 7 10 1.60E + 13 


very ill-conditioned, and increasingly so as n increases. As such, it has been a. 
favorite numerical example for checking programs for solving linear systems of 
equations, to determine the limits of effectiveness of the program when dealing 
with ill-conditioned problems. Table 8.2 gives the condition number cond (H,). 
for a few values of n. The inverse matrix H,' = [a{?)] is known explicitly: 


(n) = (-1)'"(n +i-1)"(n+j-1)! 


“i ~ Gaj-)IG-DIG_-Di-pn-y 8” 


(8.4.13) 


For additional information on H,, including an asymptotic formula for 
cond (H,,)., see Gregory and Karney (1969, pp. 33-38, 66-73). 

Although widely used .as an example, some care must be taken as to what is 
the true answer. Let H,, denote the version of H,, after it is entered into the finite 
arithmetic of a computer. For a matrix inversion program, the results of the 
program should be compared with H>?, not with Hy}; these two inverse 
matrices can be quite different. For example, if we use four decimal digit 
floating-point arithmetic with rounding, then 


H;=j| 5000 .3333  .2500 


1.000  .5000 3333 | 
(8.4.14) 
3333 .2500 .2000 


Rounding has occurred only in expanding } in decimal fraction form. Then 


9.000 36.00 30.00 
Hz? = | —36.00 192.0 —180.0 


30.00 —180.0 180.0 


9.062 —36.32 30.30 
Hy} =|-36.32 | 193.7 —181.6 (8.4.15) 


30.30  —181.6 181.5 


Any program for matrix inversion, when applied to Hs should have its resulting 
solution compared with H;}, not H;*. We return to this example later in 
Section 8.5. 
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Error bounds We consider the effects of rounding error on the solution £ to 
Ax = b, obtained using Gaussian elimination. We begin by giving a result 
bounding the error when b and A are changed by small amounts. This is a useful 
result by itself, and it is necessary for the error analysis of Gaussian elimination 
that follows later. 


Theorem 8.4 Consider the system Ax = b, with A nonsingular. Let 6A and 5b 
be perturbations of A and b, and assume 


Awl 


Then A + 6A is nonsingular. And if we define 6x implicitly by 


(Ad GAN + Bx) = 4d (8.4.17) 
then . 
“[axi|___eond(A) a 16 ae 
oo 8.4.18 
Tl scat oa jay * yo f 418) 
Al 


Proof First note that 5A represents any matrix satisfying (8.4.16), not a , 
constant 6 times the matrix A, and similarly for 6b and 6x. Using 
(8.4.16), the nonsingularity of A + 6A follows immediately from Theo- 
rem 7.12 of Chapter 7. From (7.4.11), 

aaa 


Lea 8.4.19 
1 — JA" {184 ( ) 


(A +64) ‘I< 
Solving for 5x in (8.4.17), and using Ax = b, 
(A + 5A) dx + Ax + (6A)x =b + 8b 
Sx = (A + 8A) '[8b — (8A) x] 


Using (8.4.19) and the definition (8.4.5) of cond (4), 


ee ae (+ + Ibx | 
~ cond (aye Atal [All 
All 


Divide by ||x|| on both sides, and use |{bj| < || All |x|} to obtain (8.4.18). 
a 


The analysis of the effect of rounding errors on Gaussian elimination is due to 
J, H. Wilkinson, and it can be found in Wilkinson (1963, pp. 94-99), (1965, pp. 
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209-216), Forsythe and Moler (1967, chap. 21), and Golub and Van Loan (1983, 
chap. 4). Let % denote the computed solution of Ax = b. It is very difficult to 
compute directly the effects on x of rounding at each step, as a means of 
obtaining a bound on ||x — %j|. Rather, it is easier, although nontrivial, to take 2 
and the elimination algorithm and to work backwards to show that £ is the exact 


solution of a system 
(A + 6A)R =b 


in which bounds can be given for 5A. This approach is known as backward error 
analysis. We can then use the preceding Theorem 8.4 to bound ||x — 2||. In the 
following result, the matrix norm will be ||Aj{,,, the row norm (7.3.17) induced by 
the vector norm ||-]|,,- 


Theorem 8.5 Let A be of order n and nonsingular, and assume partial or 
complete pivoting is used in the Gaussian elimination process. 


Define 


= k 
ai All vo Max lair (8.4.20) 


Let u denote the unit round on the computer being used. [See 
(1.2.11)-(1.2.12) for the definition of u.] 


1. The matrices ZL and U computed using Gaussian elimination 
satisfy 


LU=A+E, | 
Ellen < 77H All oY (8.4.21) 


2. The approximate solution % of Ax = b, computed using 
Gaussian elimination, satisfies 


(A +8A)Z=b (8.4.22) 
with 
I18All.. oe 
< {1.01(n? + 3n?)pu} (8.4.23) 
Allee 
- 3. Using Theorem 8.4, 
~ Seo cond (A),, 
le < [1.01(n3 + 3n?)pu] (8.4.24) 
Ils 1 ~ cond(A) ES 
. * Allo 


Proof The proofs of (1) and (2) are given in Forsythe and Moler (1967, chap. 
’ 21). Variations on these results are given in Golub and Van Loan (1983, 


chap. 4). 
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Empincally, the bound (8.4.23) is too large, due to cancellation of rounding 
errors of varying magnitude and sign. According to Wilkinson (1963, p. 108), a 
better empirical bound for most cases is 


WSAll.. 


nu (8.4.25) 
All. 


The result (8.4.24) shows the importance of the size of cond (A). 
The quantity p in the bounds can be computed during the elimination process, 
and it can also be bounded a priori. For complete pivoting, an a priori bound is 


p<l8n™ 4 nd 


and it is conjectured that p < cn for some c. For partial pivoting, an a priori 
bound is 2"~!, and pathological examples are known for which this is possible. 
Nonetheless, in all empirical studies to date, p has been bounded by a relatively 
small number, independent of n. Because of the differing theoretical bounds for 
p, complete pivoting is sometimes considered superior. In actual practice, how- 
ever, the error behavior with partial pivoting is as good as with complete 
pivoting. Moreover, complete pivoting requires many more comparisons at each 
step of the elimination process. Consequently, partial pivoting is the approach 
used in all modern Gaussian elimination codes. 

One of the most important consequences of the preceding analysis is to show 
that Gaussian elimination is a very stable process, provided only that the matrix 
A is not badly ill-conditioned. Historically, researchers in the early 1950s were 
uncertain as to the stability of Gaussian elimination for larger systems, for 
example, 7 > 10, but that question has now been settled. 

The size of the residual in the computed solution %, namely 


r=b— Az (8.4.26) 


is sometimes linked, mistakenly, to the size of the error x — X. In fact, the error 
in £ can be large even though r is small, and this is usually the case with 
ill-conditioned problems. From (8.4.26) and Ax = b, 


r=A(x—%) 
x-%=Aq}, (8.4.27) 


and thus x — % can be much larger than r if A | has large elements. 
In practice, the residual r is quite small, even for ill-conditioned problems. To 
suggest why this should happen, use (8.4.22) to obtain 


r= (6A)z 
Ills S E-Allell £10 


IWFlleo SAll.0 
<< 


eS 8.4.28 
Alloll*llo All. ( 
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The bounds for {|&Aj|/||A||, in (8.4.23) or (8.4.25), are independent of the 
conditioning of the problem. Thus |{r||,, will generally be small relative to 
All JX}. The latter is often close to |]b|| or is of the same magnitude, since 
b = Ax, and then ]|r|| will be small relative to ||b||. As a final note on the size of 
the residual, there are some problems in which it is important only to have r be 
small, without x — % needing to be small. In such cases, ill-conditioning will not 
have the same meaning. 

The bounds (8.4.18) and (8.4.24) indicate the importance of cond(A) in 
determining the error. Generally if cond(A) = 10”, some m > 0, then about m 
digits of accuracy will be lost in computing <X, relative to the number of digits in 
the arithmetic being used. Thus measuring cond (A) = || All ||4~+l| is desirable. 
The term |All! is easy and inexpensive to evaluate, and ||A~4|j is the main 
problem in computing cond(A). Calculating A~! requires n°? operations, and 
this is too expensive a way to compute ||A~1I|. A less expensive approach, using 
O(n’) operations, was developed for the LINPACK package. 

For any system Ay = d, 


y=A-d 
Kyl < WAT Ma 


iyi 


A's 
|An idl 


(8.4.29) 


We want to choose d to make this ratio as large as possible. Write 4 = LU, with 
LU obtained in the Gaussian elimination. Then solving Ay = d is equivalent to 


solving 


Iw=d Uy =w 


While solving Lw = d, develop d to make w as large as possible, while retaining 
| |].. = 1. Then solve Uy = w for y. This will give a better bound in (8.4.29) than 
a randomly chosen d. An algorithm for choosing d is ‘given in Golub and 
Van Loan (1983, p. 77). The algorithm in LINPACK is a more complicated 
extension of the preceding. For a description see Golub and Van Loan (1983, p. 
78) or Dongarra et al. (1979, pp. 1.12-1.13). 


A posteriori error bounds We begin with error bounds for a computed inverse | 
C of a given matrix A. Define the residual matrix by 


R=I-CA 


Theorem 8.6 If ||R|| <1, then A and C are nonsingular, and 


IRL 2 14"'- IR) 
WAN ICH” =o ich ss HRY (8.4.30) 
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Proof Since ||R{| < 1, J — R is nonsingular by Theorem 7.11 of Chapter 7, and 


] x 
I- R) |< ——— 
IK ys 1 |RI 


But 
I-R=CA (8.4.31) 


0 # det(J — R) = det(CA) = det (C) det (A) 
and thus both det(C) and det(A) are nonzero. This shows that both A 
and C are nonsingular. 
For the lower bound in (8.4.30), 
R=I-CA=(A7!-C)A, 
Rll < }A7* — CIAL 


and dividing by |] Al] ||C]] proves the result. For the upper bound, (8.4.31) 
implies 


(I-R)*=ac 
A =(I-—R)'C (8.4.32) 
For the error in C, 
A7}—~ C=(I- CA)A7 = RA“1= R(I-R)'C 


HRINC| 
1— [Rll 


A-* - Cis 


This completes the proof. x 


This result is generally of mote theoretical than practical interest. Inverse 
matrices should not be produced for solving a linear system, as was pointed out 
earlier in Section 8.1. And as a consequence, there is seldom any real need for the 
preceding type of error bound. The main exception is when C has been produced 
as an approximation by means other than Gaussian elimination, often by some 
theoretical derivation. Such approximate inverses are then used to solve Ax = b 
by the residual correction procedure (8.5.3) described in the next section. In this 
case, the bound (8.4.30) can furnish some useful information on C. 


Corollary Let A, C, and R be as given in Theorem 8.6. Let ¥ be an 
approximate solution to Ax = b, and define r = b — AX. Then 


oy < ICT (8.4.33) 
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Proof From 


= b — AX = Ax — AR = A(x — &) 
x-%8=A7r=(I1-—R)'Cr (8.4.34) 


with (8.4.32) used in the last equality. Taking norms, we obtain (8.4.33). 
a 


This bound (8.4.33) has been found to be quite accurate, especially when 
compared with a number of other bounds that are commonly used. For a 
complete discussion of computable error bounds, including a number of exam- 
ples, see Aird and Lynch (1975). 

The error bound (8.4.33) is relatively expensive to produce. If we suppose that 
& was obtained by Gaussian elimination, then about n?/3 operations were used 
to calculate ¥ and the LU decomposition of A. To produce C = A™! by 
elimination will take at least 3° additional CpeeeHons: producing CA requires n? 
multiplications, and producing Cr requires n*. Thus the error bound requires at 


’ least a fivefold increase in the number of operations. It is generally preleabley to 


estimate the error by solving approximately the error equation 

A(x-)=r 
using the LU decomposition stored earlier. This requires n? operations to 
evaluate r, and an additional n? to solve the linear system. Unless, the residual 


matrix R = I — CA has norm nearly one, this approach will give a very reason- 
able error estimate. This is pursued and illustrated in the next section. 


8.5 The Residual Correction Method 
We assume that Ax = b has been solved for an approximate solution % = x. 
Also the LU decomposition along with a record of all row or column inter- 
changes should have been stored. Calculate 

rO = b — 4x (8.5.1) 
Define e = x — x. Then as before in (8.4.34), 


Ae = 7 


Solve this system using the stored LU decomposition, and call the resulting 
approximate solution ©. Define a new approximate solution to Ax = b by 


xO = xO 4 2O (8.5.2) 


The process can be repeated, calculating x,..., to continually decrease the 
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error. To calculate r takes n? operations, and the calculation of é takes an 
additional mn? operations. Thus the calculation of the improved values 
x, x@,..., is inexpensive compared with the calculation of the original value 
x, This method is also known as iterative improvement or the residual correction 
method, 

It is extremely important to obtain accurate values for r, Since x ap- 
proximately solves Ax = b, r® will generally involve loss-of-significance errors 
in its calculation, with Ax and b agreeing to almost the full precision of the 
machine arithmetic. Thus to obtain accurate values for r, we must usually go to 
higher precision arithmetic. If only regular arithmetic is used to calculate r®, the 
same arithmetic as used in calculating x and LU, then the resulting inaccuracy 
in r© will usually leads to @ being a poor approximation to e. In single 
precision arithmetic, we calculate r in double precision. But if the calculations 
are already in double precision, it is often hard to go to a higher precision 
arithmetic. 


Example Solve the system Ax = b, with A = Hy from (8.4.14). The arithmetic 
will be four decimal digit floating-point with rounding. For the right side, use 


b = [1,0,0]7 
The true solution is the first column of Hy 1 which from (8.4.15) is 
x = [9.062, — 36.32, 30.30]” 


to four significant digits. 
Using elimination with partial pivoting, 


x = [8.968, —35.77,29.77]7 


The residual r is calculated with double precision arithmetic, and is then 
rounded to four significant digits. The value obtained is 


r© = [— 005341, — .004359, — 005344] . 
Solving Ae = r© with the stored LU decomposition, | 
€ = [.09216, — 5442, .5239]7 
x = [9.060, = 36.31, 30.29]” 
Repeating these operations, 
r = [— 0006570, — .0003770, ~ .0001980]7 
é® = [.001707, — .01300, .01241]” 


x® = [9.062, — 36.32, 30.30]7 
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The vector x) is accurate to four significant digits. Also, note that x — x© = 
é© js an accurate predictor of the error e® in x. 


Formulas can be developed to estimate how many iterates should be calcu- 
lated in order to get essentially full accuracy in the solution x. For a discussion of 
what is involved and for some algorithms implementing this method, see Dongarra 
et al. (1979, p. 1.9), Forsythe and Moler (1967, chaps. 13, 16, 17), Golub and 
Van Loan (1983, p. 74), and Wilkinson and Reinsch (1971, pp. 93-110). 


Another residual correction method There are situations in which we can 
calculate an approximate inverse C to the given matrix A. This is generally done 
by carefully considering the structure of A, and then using a variety of approxi- 
mation techniques to estimate A~}. Without considering the origin of C, we show 


how to use it to iteratively solve Ax = b. 
Let x be an initial guess, and define r© = b — Ax. As before, A(x — 
x) = r©, Define x implicitly by 


x9 ~~ = GO 
In general, define 


ro = h-— Ax — xO" D = XO 4 Cr — om = 0,1,2,... (8.5.3) 


If C is a good approximation to A~, the iteration will converge rapidly, as 
shown in. the following analysis. 
We first obtain a recursion formula for the error: 


x — xO" & x — xO) — Cp) = x — x — C[b — Ax™ ] 


=x — xl — CLAx — Ax] 


x — xD = (I — CA)(x — x) (8.5.4) 
By induction 
x— x =(I—CA)"(x-x) m20 (8.5.5) 
If | 
| — CA} <1 (8.5.6) 


for some matrix norm, then using the associated vector norm, 
IIx — xy} < |Z — CAN x — x} (8.5.7) 


And this converges to zero as m — 00, for any choice of initial guess x. More 
generally, (8.5.5) implies that x”) converges to x, for any choice of x, if and 
only if 


(1-CA)">0 as m0 


THE RESIDUAL CORRECTION METHOD 543 


And by Theorem 1.9 of Chapter 7, this is equivalent to 
¥(1= CA) <1 (8.5.8) 


for the special radius of J — CA. This may be possible to show, even when (8.5.6) 
fails for the common matrix norms. Also note that 


I-AC=A(I-CA)A7} 


and thus J — AC and J — CA are similar matrices and have the same eigenval- 
ues. If 


| — ACI] <1 (8.5.9) 


then (8.5.8) is true, even if (8.5.6) is not true, and convergence will still occur. 
Statement (8.5.4) shows that the rate of convergence of x‘”) to x is linear: 


yx — x" DY < ellx— x] mz (8.5.10) 
with c < 1 unknown. The constant c is often estimated computationally with 


IJx0"*2) & xomt Dy 


c¢ = Max (8.5.11) 


fleet? zm xe 


with the maximum performed over some or all of the iterates that have been 
computed. This is not rigorous, but is motivated by the formula 


x(m*2) — el™41) = (J — CA) (x49 — x) (8.5.12) 
To prove this, simply use (8.5.4), subtracting formulas for successive values of m. 
If we assume (8.5.10) is valid for the iterates that we are calculating, and if we 
have an estimate for c, then we can produce an error bound. 
et a= ae] = lee] 


= [fx — xO] — fe — Ort] 


= [lx — x] — elf — x] 


1 
eee ee a 


c 
Ix — xO" DY < i xO") — xO) (8.5.13) 


—C€ 
For slowly convergent iterates [with c +1], this bound is important, since 
{Jx"*) — x] can then be much smaller than ||x — x |]. Also, recall the 


earlier derivation in Section 2.5 of Chapter 2. A similar bound, (2.5.5), was 
derived for the error in a linearly convergent method. 
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Example Define A(e) = Ay + €B, with 


0 1 2 


As an approximate inverse to A(e), use 
3 
4 
A(e) = C=Azt =] -3 
L 
4 


We can solve the system A(e)x = b using the residual correction method (8.5.3). 
For the convergence analysis, 


I- CA(e) =1-— Ap *[A, + €B] = -€Ag'B 


il 
I 
mn 
i) 
wim 
Ni Oo wi 


Convergence is assured if 
IZ — CA(€) Iho = le] <1 
and from (8.5.4), 
IJx ~ xO], < fel jx— x], m2 


There are many situations of the kind in this example. We may have to solve 
linear systems of a general form A(e)x = b for any € near zero. To save time, we 
obtain either A(0)~* or the LU decomposition of A(0). This is then used as an 
approximate inverse to A(e), and we solve A(e)x = b using the residual correc- 
tion method. 


8.6 Iteration Methods 


As was mentioned in the introduction to this chapter, many linear systems are 
too large to be solved by direct methods based on Gaussian elimination. For 
these systems, iteration methods are often the only possible method of solution, 
as well as being faster than elimination in many cases. The largest area for the 
application of iteration methods. is to the linear systems arising in the numerical 
solution of partial differential equations. Systems of orders 10° to 10° are not 
unusual, although almost all of the coefficients of the system will be zero. As an 
example of such problems, the numerical solution of Poisson’s equation is studied 
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in Section 8.8. The reader may want to combine reading that section with the 
present one. 

Besides being large, the linear systems to be solved, Ax = b, often have 
several other important properties. They are usually sparse, which means that 
only a small percentage of the coefficients are nonzero. The nonzero coefficients 
generally have a special pattern in the way they occur in A, and there is usually a 
simple formula that can be used to generate the coefficients a,;, as they are 
needed, rather than having to store them. As one consequence of these properties, 
the storage space for the vectors x and b may be a more important consideration 
than is storage for A. The matrices A will often have special properties, which are 
discussed in this and the next two sections. 

We begin by defining and analyzing two classical iteration methods; following 
that, a general abstract framework is presented for studying iteration methods. 
The special properties of the linear system Ax = b are very important when 
setting up an iteration method for its solution. The results of this section are just 
a beginning to the design of a method for any particular area of applications. 


The Gauss—Jacobi method (Simultaneous displacements) Rewrite Ax = b as 


1f 
eae b, - be ai7%) i=1,2,...,n (8.6.1) 
i 


assuming all a;; # 0. Define the iteration as 


1 n : 
Kee bes De axle i=1,....1 m2>0 (8.6.2) 
aj; j=l? 
pei 
and assume initial guesses x, / = 1,..., 1, are given. There are other forms to 
the method. For example, many problems are-given naturally in the form 


(1-B)x=5b 
and then we would usually first consider the iteration 
x") = b+ Bx(™ m>0 (8.6.3) 


Our initial error analysis is restricted to (8.6.2), but the same ideas can be used 
for (8.6.3). 

To analyze the convergence, let e(”) = x — x", m > 0. Subtracting (8.6.2) 
from (8.6.1), 


‘ n 
a.. 
em —L etm) i=1,...,.n m2>0 (8.6.4) 
j=l ai; 
j#i 
a ra 
ij 
jef"*) < Y [eI 
j=l ii 


SFI 
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Define 
n= Mex 2 7 (8.6.5) 
pti 
Then 
lefr* | < pie}, 
and since the right side is independent of i, 
er Le Seles. . (8.6.6) 
If » < 1, then e’”) > 0 as m— o with a linear rate bounded by p, and 
en S BNE Un (8.6.7) 


In order for » < 1 to be true, the matrix A must be diagonally dominant, that is, 
it must satisfy 


Vial < lag §=1,2,...,0 (8.6.8) 
vai 


Such matrices occur in a number of applications, and often the associated matrix 
is sparse. 
To have a more general result, write (8.6.4) as 


el) = Me™ m>O0 (8.6.9) 
a a, 
0 ~=2 ia E 
44 41 
mel 0 O23 42n 
M=-—| 42 an 422 
any 0 
aan 
_ Inductively, 
e(™ = MMe (8.6.10) 


If we want e’”) — 0 as m > oo, independent of the choice of x (and thus of 
e), it is necessary and sufficient that 


M"™-0 as m-— oo 
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Or equivalently from Theorem 7.9 of Chapter 7, 
r,(M) <1 (8.6.11) 
The condition p < 1 is merely the requirement that the row norm of M be less 
than 1, |||], <1, and this implies (8.6.11). But now we see that e(”) > 0 if 
||| < 1 for any operator matrix norm. 
Example Consider solving Ax = b by the Gauss—Jacobi method, with 
10 3 1 14 0 
A=|{2 -10 3 b=|-5 x™ = 19 (8.6.12) 
1 3 10 14 0 


Solving for unknown i in equation i, we have x = g + Mx, 


0 -3 -1 1.4 

2 0 3 g=] 5|- 
-1 -3 0 1.4 

The true solution is x =[1,1,1]” To check for convergence, note that 
IM. = 5, || M||, = .6. Hence 


M= 


Je" < Sie]. m>0 (8.6.13) 


A similar statement holds for ||e“”"*||,. Thus convergence is guaranteed, and the 
errors will decrease by at least one-half with each iteration. Actual numerical 
results are given in Table 8.3, and they confirm the result (8.6.13). The final 
column is 


ees NEM . 
Ratio = je D] (8.6.14) 


It demonstrates that the convergence can vary from one step to the next, while 
satisfying (8.6.13), or more generally (8.6.6). 


Table 8.3 Numerical results for the Gauss—Jacobi method 


m me xf”) ee as Hee” Veo Ratio 
0 0 . 0 0 1.0 

1 1.4 i) 1.4 i) a) 

2 Lal 1.20 1.11 2 4 
3. -929 1.055 -929 O71 -.36 
4 -9906 9645 -9906 - 0355 50 
5 1.01159 9953 * 1.01159 01159 .33 
6 1.000251 1.005795 1.000251 © 005795 50 
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The Gauss—Seidel method (Successive displacements) Usirig (8.6.1), define 


1 i-l n ; 
xlD = — 1b — Fayxl— Yl a, xfm i=1,2,...,n .(8.6.15) 
aij j= jmitl 


Each new component x{"* is immediately used in the computation of the next 
component. This is convenient for computer calculations, since the new value can 
be immediately stored in the location that held the old value, and this minimizes 
the number of necessary storage locations. The storage requirements for x with 
the Gauss—Seidel method is only half what it would be with the Gauss—Jacobi 


method. 
To analyze the error, subtract (8.6.15) from (8.6.1): 


i-1 a:. n a.. 
t i a 
elma — } me ee - ¥ =e i=1,2,...,n (8.6.16) 
=1 Fi gait. Fi 
Define 
i-1 
a; a; 
a; = > = 8; = jcc = 1, n 
j=l Git fait {Gi 


with a, = B, = 0. Using the same definition (8.6.5) for 4 as with the Jacobi 
method, 


= Max (a; + B;) 


lsisn 


We assume p < 1. Then define 


n = Max Bi (8.6.17) 


l<i<n1l—a, 
From (8.6.16), 
jel™*D) <ae™* PY, + Bite, i=l... (8.6.18) 
Let k be a subscript for which 
eo" * PI, = leer? 
Then with i = k in (8.6.18), | 
| fem, < ale" MH]. + Belle len 


By 


[aoe Peeper 


— a, 


and thus 
er, < mle o leo (8.6.19) 
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Table 8.4 Numerical results for the Gauss~Seidel method 


m xi) xi” xin) le" UL Ratio 
0 0 0 0 1 

1 1.4 78 1.026 4 4 

2 1.063400. 1.020480 .987516 6.34E — 2 16 

3 -995104 .995276 1.001907 4.90E — 3 .O77 
4 1.001227 1.000817 999632 1.23E — 3 25 

5 -999792 .999848 1.000066 2.08E — 4 17° 
6 1.000039 1.000028 -999988 3.90E — 5 19 


Since for each i, 


. a.{1 — (a; + B; a; 
(at a) = pA = RI BN a2 


i i 
we have 


n<p<l (8.6.20) 


Combined with (8.6.19), this shows the convergence of e'”) > 0 as m — co. 
Also, the rate of convergence will be linear, but with a faster rate than with the 
Jacobi method. 


Example Use the system (8.6.12) of the previous example, and solve it with the 
Gauss-Seidel method. By a simple calculation from (8.6,17) and (8.6.12), 


n= 4 


The numerical results are given in Table 8.4. The speed of convergence is 
significantly better than for the previous example of the Gauss—Jacobi method, 
given in Table 8.3. The values of Ratio appear to converge to about .18. 


General framework for iteration methods To solve Ax = b, form a split of A: 
A=N-P (8.6.21) 
and write Ax = b as 
Nx = b+ Px (8.6.22) 
The matrix N is chosen in such a way that the linear system Nz =f is “easily 
solvable” for any f. For example, N might be diagonal, triangular, or tridiagonal. 
Define the iteration method by 


Nx(™*) = b + Px”) m>0 (8.6.23) 


with x given. 
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Example |. The Jacobi method. 
N = diag [a,,, a3),---, Gan] P=N-A 


2. Gauss—Seidel method. 


ay 0 0 
a a O -:-- 0 

N=|[ oO" (8.6.24) 
Ze i 


To analyze the error, subtract (8.6.23) from (8.6.22) to get 


Net) = pe(™, 
elm) = Me™ M=N7!P (8.6.25) 

By induction, 
eM=M" m>0 (8.6.26) 


In order that e(”) - oo as n > o, for arbitrary initial guesses x (and thus 
arbitrary e), it is necessary and sufficient that 


M"™-—0 as m-— 


Or equivalently from Theorem 7.9, 
r4(M) <1 (8.6.27) 


This general framework for iteration methods is adapted from Isaacson and 
Keller (1966, pp. 61-81). 

The condition (8.6.27) was derived earlier in (8.6.11) for the Gauss—Jacobi 
method. For the Gauss-Seidel method the matrix N~'P, with N given by 
(8.6.24), is more difficult to work with. We must examine the values of A for 
which 


det (AJ — N~'P) =0 
or equivalently, 
det (AN - P) =0 (8.6.28) 


For applications to the numerical solution of the partial differential equations 
in Section 8.8, the preceding convergence analysis of the Gauss—Seidel method is 
not adequate. The constants » and 7 of (8.6.5) and (8.6.17) will both equal 1, 
although empirically the method still converges. To deal with many of these 
systems, the following important theorem is often used. 
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Theorem 8.7 Lei A be Hermitian with positive diagonal elements. Then the 
Gauss-Seidel method (8.6.15) for solving Ax = b will converge, 
for any choice of x, if and only if A is positive definite. 


Proof The proof is given in Isaacson and Keller (1966, pp. 70-71). For the 
definition of a positive definite matrix, recall Problem 14 of Chapter 7. 
The theorem is illustrated in Section 8.8. | 


Other iteration methods The best iteration methods are based on a thorough 
knowledge of the problem being solved, taking into account its special features in 
the design of the iteration scheme. This usually includes looking at the form of 
the matrix and the source of the linear system. 

The matrix A may have a special form that leads to a simple iteration method. 
For example, suppose A is of block tridiagonal form: 


B, Cc 0 
B. 
yee 43 2 G : : (8.6.29) 
0 Ax 


The matrices A,, B;,C, are square, of order m, and A is of order n = rm. For 
x,bER’, write x and b in partitioned form as 


Xa bay 
x= |: b=]: X(iy> Bay © R” 
*n at) 


Then Ax = b can be rewritten as 
BXq) + CyX@ = bay 
AXG-1 + Bw + GxGay = bi 2Sisr-1 (8.6.30) 


A,Xip—ay + BX (ry = Bory 


We assume the linear systems 

Bxy=dy Asj<r (8.6.31) 
are easily solvable, probably directly for all right sides d, Ns For example, we 
often have all B. = T, a constant tridiagonal matrix to which the procedure in. 


(8.3.20)—(8.3.24) can be applied. 
A Jacobi-type method can be applied to (8.6.30): 


1) 
Bx"? = ba — Gx§} 
Bx? = by — AXP — Gxf, 2sisr-1 (8.6.32) 


1) 
Bx (yt d= Bury Fa" A,x{?) yy 
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for » > 0. The analysis of convergence is more complicated than for the 
Gauss—Jacobi and Gauss-Seidel methods; some results are suggested in Problem 
29. Similar methods are used with the linear systems arising from solving some 
partial differential equations. 

Another important aspect of solving linear systems Ax = b is to look at their 
origin. In many cases we have a differential or integral equation, say 


aAx=y (8.6.33) 
where x and y are functions. This is discretized to give a family of problems 
AnXn = Yn Xn Yn ER" (8.6.34) 


with A, of order n. As n — oo, the solutions x, of (8.6.34) approach (in some 
sense) the solution x of (8.6.33). Thus the linear systems in (8.6.34) are closely 
related. For example, in some sense A7! = A>! for m and n sufficiently large, 
even though they are matrices of different orders. This can be given a more 
precise meaning, leading to ways of iteratively solving large systems by using the 
solvability of lower order systems. Recently, many such methods have been 
developed under the name of multigrid methods, with applications particularly to 
partial differential equations [see Hackbusch and Trottenberg (1982)]. For itera- 
tive methods for integral equations, see the related but different development in 
Atkinson (1976, part II, chap. 4). Multigrid methods are very effective and 
efficient iterative methods for differential and integral equations. 


8.7 Error Prediction and Acceleration 


From (8.6.25), we have the error relation 
x— xD M(x—x™) me 0 (8.7.1) 


The manner of convergence of x‘”) to x can be quite complicated, depending on 
the eigenvalues and eigenvectors of M. But in most practical cases, the behavior 
of the errors is quite simple: The size of |jx — x], decreases by approxi- 
mately a constant factor at each step, and 


fete gs ele aie (8.7.2) 


for some c < 1, closely related to r,(M). To measure this constant c, note from 
(8.7.1) that 


ime) — lm) = plm) — pimtl) = yye(m-1) — yeim) 


xm) = M(x) — x"-D) am >0 (8.7.3) 


This motivates the use of 


SO NS Ns 


c= px Dy Dy (8.7.4) 
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Table 8.5 Example of Gauss~Seidel iteration 


m Jule? = ult DY Ratio Est. Error Error 

20 1.20E — 3 966 3.42E — 2 3.09E — 2 
21 1.16E —3 966 3.24E — 2 2.98E — 2 
22 1.12E — 3 965 3.08E — 2 2.86E — 2 
23 1.08E — 3 965 2.93E — 2 2.76E — 2 
24 1L.04E — 3 964 2.80E — 2 2.65E — 2 
60 2.60E —- 4 962 6.58E — 3 6.58E — 3 
61 2.50E — 4 962 6.33E — 3 6.33E -— 3 
62 2.41E — 4 .962 6.09E — 3 6.09E — 3 


or for greater safety, the maximum of several successive such ratios. In many 
applications, this ratio is about constant for large values of m. 
Once this constant c has been obtained, and assuming (8.7.2), we can bound 


the error in x("*» by using (8.5.13): 


px — xl DY Jp" = MI], (8.7.5) 


l-c¢ 
This bound is important when c = 1 and the convergence is slow. In that case, 
the difference j[x("*) — x(™\!_ can be much smaller than the actual error 


[J — x DY. 


Example The linear system (8.8.5) of Section 8.8 was solved using the 
Gauss~Seidel method. In keeping with (8.8.5), we denote our unknown vector by 
u. In (8.8.4), the function f = x?y?, and in (8.8.5), the function g = 2(x? + y?). 
The region was 0 < x, y < 1, and the mesh size in each direction was h = %. 
This gave an order of 225 for the linear system (8.8.5). The initial guess u in the 
iteration was based on a “bilinear” interpolant of f= x*y? over the region 
0 <x, y < 1 [see (8.8.17)]. A selection of numerical results is given in Table 8.5. 
The column Ratio is calculated from (8.7.4), the column Est. Error uses (8.7.5), 
and the column Error is the true error |lu — u I]. 

As can be seen in the table, the convergence was quite slow, justifying the 
need for (8.7.5) rather than the much smaller lu" — a! ~ "I. As m— 2, the 
value of Ratio converges to .962, and the error estimate (8.7.5) is an accurate 
estimator of the true iteration error. 


Speed of convergence We now discuss how many iterates to calculate in order 
to obtain a desired error. And when is iteration preferable to Gaussian elimina- 
tion in solving Ax = b? We find the value of m for which 


IIx — x]. < €llx — XH], (8.7.6) 


with ¢ a given factor by which the initial error is to be reduced. We base the 
analysis on the assumption (8.7.2). Generally the constant ¢ is almost equal to 
r,(M), with M as in (8.7.1). . 
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The relation (8.7.2) implies 
jx — xP], <c™lx— x], m>O0 (8.7.7) 
Thus we find the smallest value of m for which 
ec" <e€ 


Solving this, we must have 


—Ine 
Ry m R(c) = —Inc (8.7.8) 


ma 

Doubling R(c) leads to halving the number of iterates that must be calculated. 

To make this result more meaningful, we apply it to the solution of a dense 

linear system by iteration. Assume that the Gauss~Jacobi or Gauss-Seidel 

method is being used to solve Ax = b to single precision accuracy on an IBM 

mainframe computer, that is, to about six significant digits. Assume x = 0, and 
that we want to find m such that 


IIx — x I, 


<10-S=« 8.7.9 
IIH} ero 


Assuming A has order n, the number of operations (multiplications and divi- 
sions) per iteration is n?. To obtain the result (8.7.9), the necessary number of 
iterates is 


2 6In,10 
™ RC) 
and the number of operations is 
tn? = (61n,10) =" 
m*n* = 10) Rc) 


If Gaussian elimination is used to solve Ax = b with the same accuracy, the 
number of operations is about n?/3. The iteration method will be more efficient 
than the Gaussian elimination method if 


n 


EN Sa 


n 
m* <= (8.7.10) 


Example Consider a matrix A of order n = 51, Then iteration is more efficient 
if m* < 17. Table 8.6 gives the values of m* for various values of c. For ¢ < .44, 
the iteration method will be more efficient than Gaussian elimination. And if less 


ERROR PREDICTION AND ACCELERATION 555 


Table 8.6 Example of iteration count 


c R(c) m* 
39 105 131 
8 .223 62 
6 S11 27 
4 .916 15 
2 1.61 9 


than full precision accuracy in (8.7.9) is desired, then iteration will be more 
efficient with even larger values of c. In practice, we also will usually know an 
initial guess x that is better than x = 0, further decreasing the number of 
needed iterates. 


The main use of iteration methods is for the solution of large sparse systems, 
in which case Gaussian elimination is often not possible. And even when 
elimination is possible, iteration may still be preferable. Some examples of such 
systems are discussed in Section 8.8. 


Acceleration methods Most iteration methods have a regular pattern in which - 
the error decreases. This can often be used to accelerate the convergence, just as 
was done in earlier chapters with other numerical methods. Rather than giving a 
general theory for the acceleration of iteration methods for solving Ax = b, we 
just describe an acceleration of the Gauss-Seidel method. This is one of the main 
cases of interest in applications. 

Recall the definition (8.6.15) of the Gauss-Seidel method. Introduce an 
acceleration parameter w, and consider the following modification of (8.6.15): 


1 i-l “Hn 
ght) = at a > axe = oy aim} 
Gi j=l jaitl 


xlFD & zl" D4 (1L—w)xf = f= 1,...,0 (8.7.11) 


for m > 0. The case w = 1 is the regular Gauss—Seidel method. The acceleration 
is to optimally choose some linear combination of the preceding iterate and the 
regular Gauss-Seidel iterate. The method (8.7.11), with an optimal choice of w, is 
called the SOR method, which is an abbreviation for successive overrelaxation, an 
historical term. 

To understand how w should be chosen, we rewrite (8.7.11) in matrix form. 
Decompose A as 


A=D+L+U 


with D = diag[a,,,...,4,,], L lower triangular, and U upper triangular, with 
both L and U having zeros on the diagonal. Then (8.7.11) becomes 


zimth = D"{b = Lx(™*) — Ux] 


xD = woz D4 (1 —w)x™ ms 
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Eliminating z°"*» and solving for x¢"*), 
[J + woD"L]xt"*) =wD b+ [( = w) I = wD-'U| x” 


For the error, 


e™Y= M(w)e™ m0 (8.7.12) 
M(w) = [1+ wD™L]""[(1 — wo) 1 -— wD~'U| (8.7.13) 


The parameter w is to be chosen to minimize r,(M(w)), in order to make x” 
converge to x as rapidly as possible. Call the optimal value w*. 

The calculation of w* is difficult except in the simplest of cases. And usually it 
is obtained only approximately, based on trying several values of w and observ- 
ing the effect on the speed of convergence. In spite of the problem of calculating 
w*, the resulting increase in the speed of convergence of x”) to x is very 
dramatic, and the calculation of w* is well worth the effort. This is illustrated in 


the next section. 


Example We apply the acceleration (8.7.11) to the preceding example of the 
Gauss-Seidel method, given following (8.7.5). The optimal acceleration parame- 
ter is w* = 1.6735. A more extensive discussion of the SOR method for solving 
the linear systems arising in solving partial differential equations is given in the 
following section. The initial guess was the same as before. The results are given 
in Table 8.7. 

The results show a much faster rate of convergence than for the Gauss-Seidel 
method. For example, with the Gauss-Seidel method, we have |ju — u@?®\| | = 
9.70E — 6. In comparison, the SOR method leads to |[u — u@||,, = 8.71E — 6. 
But we have lost the regular behavior in the convergence of the iterates, as can be 
seen from the values of Ratio. The value of c used in the error test (8.7.5) needs 
to be chosen more carefully than our choice of c = Ratio in the table. You may 
want to use an average or the maximum of several successive preceding values of 


Ratio. 


Table 8.7 Example of SOR method (8.7.11) 


m Jul? — yim Dy Ratio Est. Error Error © 
21 2.06E — 4 693 4.65E — 4 3.64E — 4 
22 1.35E—4 .657 2.59E — 4 2.65E — 4 
23 8.76E — 5 648 L61E — 4 1L87E — 4 
24 S.A1E — 5 584 TATE — 5 1.39E — 4 
25 . 3.48E — 5 680 7.40E — 5 1.06E — 4 
26 2.78E — 5 800 LIE — 4 8.04E — 5 
27 2.46E — 5 884. 1.87E — 4 6.15E — 5 


28 2.07E — 5 842 111E — 4 4.16E — 5 
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8.8 The Numerical Solution of Poisson’s Equation 


The most important application of linear iteration methods is to the large linear 
systems arising from the numerical solution of partial differential equations by 
finite difference methods. To illustrate this, we solve the Dirichlet problem for 
Poisson’s equation on the unit square in the xy-plane: 


O7u ‘i a7u ( F ; 

— —_ = x,y <x, y < 

a ai (8.8.1) 
u(x, y) =f(x, y) (x, y) a boundary point 


The functions g(x, y) and f(x, y) are given, and we must find u(x, y). 
For N > 1, define A = 1/N, and 


(x;,%) = (jh, kh) O<j,kSN 


These are called the grid points or mesh points (see Figure 8.1). To approximate 
(8.8.1), we use approximations to the second derivatives. For a four times 
continuously differentiable function G(x) on [x —h,x +h], the results 
(5.7.17)~(5.7.18) of Section 5.7 give 


6"(x) = G(x +h) - se + G(x —h) 
(8.8.2) 


h2 
yp OP(8) xh bsxth 


When applied to (8.8.1) at each interior grid point, we obtain 


u(X;415 7) — 2u(x;, yx) + u(x,-1, Vx) + u(x;, Yes) —2u(x;, Ye) u(x;, Ye-1) 
h? h? 


d*u(E, yy) | A*ulx;, ee 
ys 


= 8(x,¥%) +> nee + 4 5 (8.8.3) 


for some X;_, <§ S Xj44, Wey SV SMa pl <j,k<sN-1. 
- For the ‘numerical approximation u,(x, y) of (8.8.1), let 


n(x; Ye) =F (xj Ye) (Xj He) a boundary grid point (8.8.4) 


At all interior mesh points, drop the right-hand truncation errors in (8.8.3) and 
solve for the approximating solution u,(x;, y,): 


1 “ 
u,(x;, Ye) zi qltalsjer ¥) + u,(x;, Vest) + u,(x;-1, Ye) + u,(x;, Yr-1)} 


h? 
—Galxp%) lsiksN-1 (8.8.5) 
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x 


Figure 8.1 Finite difference mesh. 


The number of equations in (8.8.4)—(8.8.5) is equal to the number of unknowns, 

(N + 1). 

Theorem 8.8 For each N > 2, the linear system (8.8.4)-(8.8.5) has a unique 
solution {u,(x;, ¥,)|0 <j,k < N}. If the solution u(x, y) of 
(8.8.1) is four times continuously differentiable, then 


Max |u(x;, ¥,) — u4(x;. ¥%) | < ch? (8.8.6) 


O<j,k<N 
1 A4u(x, y) | d*u(x, y) 
= >-{ Max |—~——~|+ Max |———— 
. 24 Me, ax4 | en ei ay 


Proof 1. We prove the unique solvability of (8.8.4)—(8.8.5) by using Theorem 
7.2 of Chapter 7. We consider the homogeneous system 


v,(X;5 Yn) = Lo, (X;41; Yr) az v,(x;, Yuri) + O,(X;—15 Vx) 
+0, (xj, %e-1)]  1<j,k<N-1 (8.8.7) 
D(X; %e) =9 (xj, %) aboundary point —_. (8.8.8) 


By showing that this system has only the trivial solution v,(x,, y,) = 0, 
it will follow from Theorem 7.2 that the nonhomogeneous system 
’ (8.8.4)—(8.8.5) will have a unique solution. 
Let 
= Max ‘5 
F Peri eae! Ye) 
From (8.8.8), « = 0. Assume a > 0. Then there must be an interior grid 
point (X,, y,) for which this maximum is attained. But using (8.8.7), 


THE NUMERICAL SOLUTION OF POISSON’S EQUATION 559 


v,(X;, Yx) is the average of the values of v, at the four points neighbor- 
ing (x;, ¥,). The only way that this can be compatible with (x,, y,) 
being a maximum point is if v, also equals a at the four neighboring grid 
points. Continue the same argument to these neighboring points. Since 
there are only a finite number of grid points, we eventually have 
v,(X;, ¥,) = & for a boundary point (x;, y,). But then a > 0 will con- 
tradict (8.8.8). Thus the maximum of v,(x,, y,) is zero. A similar 
argument will show that the minimum of v,(x,, y;,) is also zero. Taken 
together, these results show that the only solution of (8.8.7)-(8.8.8) is 


U,(X;, Vx) = 0. 
2. To consider the convergence of u,(x,, y,) to u(x;, y,), define 
e,(x;: Yr) = u(x;, yx) = u,(x;, Yn) 


Subtracting (8.8.5) from (8.8.3), we obtain 


1 
e,(x;, y= zlen(xjan Yx) +e,(x;, Yui) 5 €,(x;-1 Vx) + en (X;, Y-1)| 


(8.8.9) 


Le a*ulé;, Vx) + d*u(x,, 1) 
12 ax‘ dy4 


and from (8.8.4), 
€,(X;,¥_,) =0  (x;, y,) a boundary grid point (8.8.10) 


This system can be treated in a manner similar to that used in part (1), 
and the result (8.8.6) will follow. Because it is not central to the 
discussion of the linear systems, the argument is omitted [see Isaacson 
and Keller (1966), pp. 447~450]. x 


Example Solve 


ae aye O<x,y<l1 (8.8.11) 


u(0, y)=cos(zy)  u(1, y) = e7cos(zy) 
u(x,0) = e™ u(x,1) = —e™ 
The true solution is 
u(x, y) =e cos(ay) 


Numerical results for several values of N are given in Table 8.8. The error is the 
maximum over all grid points, and the column Ratio gives the factor by which 
the maximum error decreases when the grid size A is halved. Theoretically from 
(8.8.6), it should be 4:0. The numerical results confirm this. 
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Table 8.8 Numerical solution 


of (8.8.11) 
N Ile — Upllec Ratio 
4 144 
8 .0390 3.7 
16 .0102 3.8 
32 .00260 3.9 
64 000654 4.0 


Iterative Solution Because the Gauss—Seidel method is generally faster than the 
Gauss—Jacobi method, we only consider the former. For k = 1,2,...,N — 1, 
define 


1 
up" D(x, x) = glue (aj42 Vx) at u”(x;, Vert) a hada Care Yr) 


h? . 
tuft D(x, Y-1)] - aly yy) =f =1,2,...,N—1 (8.8.12) 


For boundary points, use 


uf” x, Vx) = f(x;, Vx) allm > 0 


The values of u{”*(x,, y,) are computed row by row, from the bottom row of 
grid points first to the top row of points last. And within each row, we solve from 
left to right. 

For the iteration (8.8.12), the convergence analysis must be based on Theorem 
8.7. It can be shown easily that the matrix is symmetric, and thus all eigenvalues 
are real. Moreover, the eigenvalues can all be shown to lie in the interval 
0 <A <2. From this and Problem 14 of Chapter 7, the matrix is positive 
definite. Since all diagonal coefficients of the matrix are positive, it then follows 
from Theorem 8.7 that the Gauss—Seidel method will converge. To show that all 
eigenvalues lie in 0 < A <2, see Isaacson and Keller (1966, pp. 458-459) or 
Problem 3 of Chapter 9. 

The calculation of the speed of convergence r,(M) from (8.6.28) is nontrivial. 
The argument is quite sophisticated, and we only refer to the very complete 
development in Isaacson and Keller (1966, pp. 463-470), including the material 
on the acceleration of the Gauss-Seidel method. It can be shown that 


r,(M) =1—- 27h? + O(h*) (8.8.13) 
The Gauss-Seidel method converges, but the speed of convergence is quite slow 


for even moderately small values of A. This. is illustrated in Table 8.5 of 
Section 8.7. 
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To accelerate the Gauss—Seidel method, use 


— 


eae Ca %) = ql (xen Yn) + uy” (x;, Yes) 


h2 
+ uf" (x, 1, Xe) + ul"? (x), ye-1)| - als, ys) 


afer (x, yx) os aof"* (x, Vx) 
+(1—w)uf(x,y) F=1,....N—-1 (8.8.14) 


for k = 1,..., N — 1. The optimal acceleration parameter is 


2 
€=1-—-2sin a (8.8.15) 
2N 
The correspondence rate of convergence is 
r,(M(o*)) = w* — 1 =1—-— 2ha + O(h?) (8.8.16) 


This is. a much better rate than that given by (8.8.13). The accelerated 
Gauss-Seidel method (8.8.14) with the optimal value w* of (8.8.15) is known as 
the SOR method. The name SOR is an abbreviation for successive overrelaxation, 
a name that is based on a physical interpretation of the method, first used in 
deriving it. 


Example Recall the previous example (8.8.11). This was solved with both the 
Gauss-Seidel method and the SOR method. The initial guess for the iteration 
was taken to be the “bilinear” interpolation formula for the boundary data /: 


u(x, y) = (1 — x) f(0, y) + xf(1, y) + (1 — y) f(x, 0) + wf(x, 1) 
~[(1 — y)(1 — x) (0,0) + (1 — y)xf(1,0) 
+y(1 — x) (0,1) + xyf(1,1)] (8.8.17) 


at all interior grid points. The error test to stop the iteration was 


— y(™ 
ean |u,(x;, Ya) — uf (X35 Vx) | SE 
with « > 0 given and the right-hand side of (8.7.5) used to predict the error in the 
iterate. The numerical results for the necessary number of iterates are given in 
Table 8.9. The SOR method requires far fewer iterates for the smaller values of h 
than.does the Gauss—Seidel method. 
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Table 8.9 Number of iterates necessary 


to solve (8.8.5) 

N € Gauss-Seidel SOR 
8 01 25 12 
8 .001 40 16 

16 .001 142 32 

32 001 495 65 
8 .0001 54 18 

16 0001 201 35 

32 .0001 733 71 


Recall from the previous section that the number of iterates, called m*, 
necessary to reduce the iteration error by a factor of € is proportional to 1/1n(c), 
where c is the ratio by which the iteration error decreases at each step. For the 
methods of this section, we take c = r,(M). If we write r,(M) = 1 — 6, then 


1 1 1 
—Inr,(M) -In(i-5) 8 


When 6 is halved, the number of iterates to be computed is doubled. For the 
Gauss-Seidel method, 


Tons: i 


Inr,(M) = 17h? 


When h is halved, the number of iterates to be computed increases by a factor of 
4. For the SOR method, 


1 — 
Inr,(M(w*)) 2h 


and when h is halved, the number of iterates is doubled. These two results are 
illustrated in Table 8.9 by the entries for « = 107? and e = 107~*. 

With either method, note that doubling N will increase the number of 
equations to be solved by a factor of 4, and thus the work per iteration will 
increase by the same amount. The use of SOR greatly reduces the resulting work, 
although it still is large when N is large. 


8.9 The Conjugate Gradient Method 


The iteration method presented in this section was developed in the 1950s, but 
has gained its main popularity in more recent years, especially in solving linear 
systems associated with the numerical solution of linear partial differential 
equations. The literature on the conjugate gradient method (CG method) has 
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become quite large and sophisticated, and there are numerous connections to 
other topics in linear algebra. Thus, for reasons of space, we are able to give only 
a brief introduction, defining the CG method and stating some of the principal 
thecretical results concerning it. 

The CG method differs from earlier methods of this chapter, in that it is based 
on solving a nonlinear problem; in fact, the CG method is also a commonly used 
method for minimizing nonlinear functions. The linear system to be solved, 


Ax =b (8.9.1) 


is assumed to have a coefficient matrix A that is real, symmetric, and positive 
definite. The solution of this system is equivalent to the minimization of the 
function 


f(x) = 4xQx—-b™x x eER" (8.9.2) 


The unique solution x* of Ax = b is also the unique minimizer of f(x) as x 
varies over R”. To see this, first show 


f(x) = E(x) — 4b7* 


E(x) = 3(x* — x)"A(x* — x) 


(8.9.3) 


Using Ax* = 5, the proof is straightforward. The functions E(x) and f(x) differ 
by a constant, and thus they will have the same minimizers. By the positive 
definiteness of A, E(x) is minimized uniquely by x = x*, and thus the same is 
true of f(x). 

A well-known iteration method for finding a minimum for a nonlinear 
function is the method of steepest descent, which was introduced briefly in Section 
2.12. For minimizing f(x) by this method, assume that an initial guess x, is 
given. Choose a path in which to search for a new minimum by looking along the 
direction in which /(x) decreases most rapidly at x. This is given by gy = 
—Vf(Xo), the negative of the gradient of f(x) at x9: 


8(Xo) = 8 = b — Axo (8.9.4) 
Then solve the one-dimensional minimization problem 


Min f(x9 + a8) 


Osa<oo 
calling the solution a,. Using it, define the new iterate 
X, = Xq + a8 e: (8.9.5) 
Continue this process inductively. The method of steepest descent will converge, 
but the convergence is generally quite slow. The optimal local strategy of using a 


direction of fastest descent is not a good strategy for finding an optimal direction 
for finding the global minimum. In comparison, the CG method will be more 
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rapid, and it will take no more than'n iterates, assuming there are no rounding 


errors. 
For the remainder of this section, we assume the given initial guess is xy = 0. 
If it is not, then we can always solve the modified problem 


Az =b— Ax, 


Denoting its solution by z*, we have x* = x, + z*. An initial guess of z* = z) = 0 
corresponds to x* = x, in the original problem. Henceforth, assume x, = 0. 


Conjugate direction methods Assuming A is n X n, we say a set of nonzero 
vectors pj,.-., P, in R" is A-conjugate if 


piAp;=0 lsi,jsn i¥j (8.9.6) 


The vectors p, are often called conjugate directions. An equivalent geometric 
definition can be given by introducing a new inner product and norm for R": 


(x, Va = xTAy 


Ill, = V(x, x)4 = Vxt4x x ER" 


The condition (8.9.6) is equivalent to requiring p,,..., p, to be an orthogonal 
basis for R” with respect to the inner product (-,-),. Thus we also say that 
{ Py,---, P,} ate A-orthogonal if they satisfy (8.9.6). With the norm jj - ||,, the 
function E(x) of (8.9.3) is seen to be 


(8.9.7) 


E(x) = 3]x* — xi (8.9.8) 


which is more clearly a measure of the error x — x*. The relationship of |{x|l2 
and ||x\l,, is further explored in Problem 36. 

Given a set of conjugate directions {p,,..., p,}, it is straightforward to solve 
Ax = b. Let 


x* = ap, + cia +a, Pr 


Using (8.9.6), 


TAx* Th 
a, = a = & k=1,...,7 (8.9.9) 
PLAP, = PLAP 


We use this formula for x* to introduce the conjugate direction iteration method. 
Let x» = 0, 


Xp= Py tes tap, skeen (8.9.10) 


Introduce 
= b - Ax, = —Vf(x,) 
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the residual of x, in Ax = b. Easily, ry = b and 
Xp = Xp $ Oy Py =n — AP, «= = Nem (8.9.11) 


For k =n, we have x, = x* and r,=0, and x, may equal x* with a smaller 
value of k. 


Lemma 1 The term r, is orthogonal to p;,..., p,, thatis, rip; = 0,1 <i<k. 
We leave the proof to Problem 37. 


Lemma 2 (a) The minimization problem 


Min f(xy-1 - ap,) 


-—c<a<c 


is solved uniquely by a = a,, yielding f(x,) as the minimum. 


(b) Let Y, = Span{ p,,..., p,}, the k-dimensional subspace gen- 
erated by { p,,..., p, }. Then the problem 


Min f(x) 


xe, 
is uniquely solved by x = x,, yielding the minimum /(x,). 
Proof (a) Expand (a) =/(x,_, + ap,): 
pa) = f(xp¢-1) + appAxy_1 + 3a°ppAp, — ab"p, 
The term p7Ax,_, equals 0, because x,_;€%,_, and p, is A- 
orthogonal to Y,_,. Solve p’(a) = 0, obtaining a = a,, to complete the 
proof. 
(b) Expand f(x, + A), for anyh EY: 
f(x, +h) =f(x,) + A7Ax, + 4h7Ah — hb 
= f(x,) + 4h7Ah — h'r, 
By Lemma 1 and the assumption h € Y,, it follows that h'r, = 0. Thus 
f(x, +h) = f(x,) + Gh7Ah > f(x,) 


since A is positive definite. The minimum is.attained uniquely in Y, by 
letting h = 0, proving (b). 


Lemma 2 gives an optimality property for conjugate direction methods, 
defined by (8.9.11) and (8.9.9). The problem is knowing how to choose the 
conjugate directions { p,}. There are many possible choices, and some of them 
lead to well-known methods for directly solving Ax = b [see Stewart (1973b)]. 
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The conjugate gradient method We give a way to simultaneously generate the 
directions { p,} and the iterates {x,}. For the first direction p,, we use the 
steepest descent direction: 


Pi = —Vf(xo) =m = 6 (8.9.12) 


since xX, = 0. An inductive construction is given for the remaining directions. 
Assume X,,...,x, have been generated, along with the conjugate directions 
Py.---> Py. A new direction p,.., must be chosen, one that is A-conjugate to 
Pis--+> Px Also, assume x, # x*, and thus r, # 0; otherwise, we would have the 
solution x* and there would be no point to proceeding. 

By Lemma ], r, is orthogonal to ,, and thus r, does not belong to Y,. We 
use r, to generate p,,,, choosing a component of r,. For reasons too com- 
plicated to consider here, it suffices to consider 


Prat = "e+ BesiPe (8.9.13) 
Then the condition p7 Ap, ,, = 0 implies 


prAr, 
PAP, 


Bia (8.9.14) 


The denominator is nonzero since A is positive definite and p, # 0. It can be 
shown [Luenberger (1984), p. 245] that this definition of p,.., also satisfies 


PiAPyxy =O =f =1,2,...,k-1 (8.9.15) 


thus showing { Pj,..., P,+,} IS A-conjugate. 

The conjugate gradient method consists of choosing { p,} from (8.9.12)- 
(8.9.14) and {x,, 7} from (8.9.11) and (8.9.9). Ignoring rounding errors, the 
method converges in n or fewer iterations. The actual speed of convergence 
varies a great deal with the eigenvalues of A. The error analysis of the CG 


method is based on the following optimality result. 
Theorem 8.9 The iterates {x,} of the CG method satisfy 
Wx* — xylla = Min |[x* - g(A)d||, (8.9.16) 
deg(q)<k 
Proof For q(A) a polynomial, the notation q(A) denotes the matrix expression 
with each power 2X’ replaced by A’. For example, 
g(A) =ayptaAta,’ = g(A)=aol +a,A + a,A? 
) The proof of (8.9.16) is given in Luenberger (1984, p. 246). a 


Using this theorem, a number of error. results can be given, varying with the 
properties of the matrix A. For example, let the eigenvalues of A be denoted by 


O<A,<--- <A, (8.9.17) 
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repeated according to their multiplicity, and let v,,..., v, denote a corresponding 
orthonormal basis of eigenvectors. Using this basis, write 
x*= Yicv, b= Ax*= VY cA,v, (8.9.18) 


j=l j=l 
Then 


a A)b = Sg A,a(2,)0, 


. 1/2 
b= aaol,= [Eh aala)) (893) 


Any choice of a polynomial g(A) of degree < k will give a bound for {[x* — x,ll 4. 
One of the better known bounds is 


‘ t= vo k ‘ 
lx? ~ alla S21 | Ulla (8.9.20) 


_ with c=A,/A,, the reciprocal of the condition number cond(A),. This is a 
conservative bound, implying poor convergence for ill-conditioned problems. Its 
proof is sketched in Luenberger (1984, p. 258, prob. 10). Other bounds can be 
derived, based on the behavior of the eigenvalues {A,} and the coefficients c, of 
(8.9.18). In many applications in which the A, vary greatly, it often happens that 
the c; for the smaller A, are quite close to zero. Then the formula in (8.9.19) can 
be manipulated to give an improved bound over that in (8.9.20). In other cases, 
the eigenvalues may coalesce around a small number of values, and then (8.9.19) 
can be used to show convergence with a small k. For other results, see Luen- 
berger (1984, p. 250), Jennings (1977), and van der Sluis and van der Vorst 
(1986). 

The formulas for {a je 8;} in defining the CG method can be further modified, 
to give simpler and more efficient formulas. We incorporate those into the 
following. 


Algorithm CG (A, b, x, n) 


1. Remark: This algorithm calculates the solution of Ax = b using 
the conjugate gradient method. 


2 Xp = 0, OH = b, Po = 9 

3. Fork =0,...,2—1, do through step 7. 
4. If 7, = 0, then set x = x, and exit. 

5. For k = 0, B, = 0; and 


Pay er ey 2 
for k > 0, Bua = Met / Te e-1 
Prat = + Beri Pr 
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Pane T 
6. ga = Pest AP Kot 
Xp Xe Ape Pes 
Thay = b— AXy yy 


7. End loop on k. 
8. == x, and exit. 


This algorithm does not consider the problems of using finite precision 
arithmetic. For a more complete algorithm, see Wilkinson and Reinsch (1971, 
pp. 57-69). 

Our discussion of the CG method has followed closely that in Luenberger 
(1984, chap 8). For another approach, with more geometric motivation, see 
Golub and Van Loan (1983, sec. 10.2). They also have extensive references to the 
literature. 


Example As a simpleminded test case, we use the order five matrix 


(8.9.21) 


aN 

tl 
mNwW hw 
Nw eH fd 
Ww PU BR Lo 
ARUP wh 
OW WN ee 


The smallest and largest eigenvalues are A, = .5484 and A, = 17.1778, respec- 
tively. For the error bound (8.9.20), ¢ = .031925, and 


1—ye | 


For the linear system, we chose 
b = [7.9380, 12.9763, 17.3057, 19.4332, 18.4196]” 
which leads to the true solution 


xe [ — 0.3227, 0.3544, 1.1010, 1.5705, 1.6897]” 


Table 8.10 Example of the conjugate gradient method 
Ix leo - WX = Xelleo lx = Xxlla Bound (8.9.20) 


k 

1 4.27 8.05E —1 2.62 12.7 
2 8.98E — 2 7.09E — 2 131E-—1 8.83 
3 2.75E — 3 3.69E — 3 4.78E — 3 6.15 
4 7.59E — 5 1.38E — 4 1.66E — 4 4.29 
5 =0 =0 =0 2.99 


DISCUSSION OF THE LITERATURE 569 


The results from using CG are shown in Table 8.10, along with the error bound 
in (8.9.20). As stated earlier, the bound (8.9.20) is very conservative. 


The residuals decrease, as expected. But from the way the directions { p,} are 
constructed, this implies that obtaining accurate directions p, for larger k will 
likely be difficult because of the smaller number of digits of accuracy in the 
residuals r,. For some discussion of this, see Golub and Van Loan (1983, p. 373), 
which also contains additional references to the literature for this problem. 


The preconditioned conjugate gradient method The bound (8.9.20) indicates or 
seems to imply that the CG iterates can converge quite slowly, even for methods 
with a moderate condition number such as cond(A), = 1/c = 100. To increase 
the rate of convergence, or at least to guarantee a rapid rate of convergence, the 
problem Ax = 6 is transformed to an equivalent problem with a smaller condi- 
tion number. The bound in (8.9.20) will be smaller, and one expects that the 
iterates will converge more rapidly. 
For a nonsingular matrix Q, transform Ax = b by 


(Q7'4Q~7)(Q"x) = Q™'b (8.9.22) 
with Q-7 = (Q7)-}. Write 
A=Q°4O- £=0% b6=Q7"'b (8.9.23) 


Then (8.9.22) is simply AX = b. The matrix Q is to be chosen so that cond(A), is 
significantly smaller than cond(A),. The actual CG method is not applied 
explicitly to solving AX = 5, but rather the algorithm CG is modified slightly. 
For the resulting algorithm when Q is symmetric, see Golub and Van Loan 
(1983, p. 374). 

Finding Q requires a careful analysis of the original problem Ax = b, under- 
standing the structure of A in order to pick Q. From (8.9.23), 


A= QAQ™ 


with A to be chosen with eigenvalues near 1 in magnitude. For example, if A is 
about the identity J, then A = QQ7. This decomposition could be accomplished 
with a Cholesky triangular factorization. Approximate Cholesky factors are used 
in defining preconditioners in some cases. For an introduction to the problem of 
selecting preconditioners, see Golub and Van Loan (1983, sec. 10.3) and Axelson 
(1985). 


Discussion of the Literature 


The references that have most influenced the presentation of Gaussian elimina- 
tion and other topics in this chapter are the texts of Forsythe and Moler (1967), 
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Golub and Van Loan (1983), Isaacson and Keller (1966), and Wilkinson (1963), 
(1965), along with the paper of Kahan (1966). Other very good general treatments 
are given in Conte and de Boor (1980), Noble (1969), Rice (1981), and Stewart 
(1973a). More elementary introductions are given in Anton ae, and Strang 
(1980). 

The best codes for the direct solution of both general and pestal forms of 
linear systems, of small to moderate size, are based on those given in the package 
LINPACK, described in Dongarra et al. (1979). These are completely portable 
programs, and they are available in single and double precision, in both real and 
complex arithmetic. Along with the solution of the systems, they also can 
estimate the condition number of the matrix under consideration. The linear 
equation programs in IMSL and NAG are variants and improvements of the 
programs in LINPACK. 

Another feature of the LINPACK is the use of the Basic Linear Algebra 
Subroutines (BLAS). These are low-level subprograms that carry out basic vector 
operations, such as the dot product of two vectors and the sum of two vectors. 
These are available in Fortran, as part of LINPACK; but by giving assembly 
language implementations of them, it is often possible to significantly improve 
the efficiency of the main LINPACK programs. For a more general discussion of 
the BLAS, see Lawson et al. (1979). The LINPACK programs are widely 
available, and they have greatly influenced the Geveiopernt of linear equation 
programs in other packages. 

There is a very large literature on solving the linear systems arising from the 
numerical solution of partial differential equations (PDEs). For some general 
texts on the numerical solution of PDEs, see Birkhoff and Lynch (1984), Forsythe 
and Wasow (1960), Gladwell and Wait (1979), Lapidus and Pinder (1982), and 
Richtmyer and Morton (1967). For texts devoted to classical iterative methods 


’ for solving the linear systems arising from the numerical solution of PDEs, see 


Hageman and Young (1981) and Varga (1962). For other approaches of more 
recent interest, see Swarztrauber (1984), Swarztrauber and Sweet (1979), George 
and Liu (1981), and Hackbusch and Trottenberg (1982). 

The numerical solution of PDEs is the source of a large percentage of the 
sparse linear systems that are solved in practice. However, sparse systems of large 
order also occur with other applications [e.g., see Duff (1981)]. There is a large 
variety of approaches to solving large sparse systems, some of which we’discussed 
in Sections 8.6-8.8. Other direct and iteration methods are available, depending 
on the structure of the matrix. For a sample of the current research in this very 
active area, see the survey of Duff (1977), the proceedings of Bjérck et al. (1981), 
Duff (1981), Duff and Stewart (1979), and Evans (1985), and the texts of George 
and Liu (1981) and Pissanetzky (1984). There are several software packages for 
the solution of various types of sparse systems, some associated with the 
preceding books. For a general index of many of the packages that are available, 
see the compilation of Heath (1982). For iteration methods for the systems 
associated with solving partial differential equations, the books of Varga (1962) 
and Hageman and Young (1981) discuss many of the classical approaches. 

Integral equations lead to dense linear systems, and other types of iteration 
methods have been used for their solution. For some quite successful methods, 
see Atkinson (1976, part II, chap. 4). 
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The conjugate gradient method dates to Hestenes and Stiefel (1952), and its 
use in solving integral and partial differential equations is still under develop- 
ment. For more extensive discussions relating the conjugate direction method to 
other numerical methods, see Hestenes (1980) and Stewart (1973b). For refer- 
ences to the recent literature, including discussions of the preconditioned con- 
jugate gradient method, see Axelsson (1985), Axelsson and Lindskog (1986), and 
Golub and Van Loan (1983, secs. 10.2 and 10.3). A generalization for nonsym- 
metric systems is proposed in Eisenstat et al. (1983). 

One of the most important forces that will be determining the direction of 
future research in numerical linear algebra is the growing use of vector and 
parallel processor computers. The vector machines, such as the CRAY-2, work 
best when doing basic operations on vector quantities, such as those specified in 
the BLAS used in LINPACK. In recent years, there has been a vast increase in 
the availability of time on these machines, on newly developed nationwide 
computer networks. This has changed the scale of many physical problems that 
can be attempted, and it has led to a demand for ever more efficient computer 
programs for solving a wide variety of linear systems. The use of parallel 
computers is even more recent, and only in the middle to late 1980s have they 
become widespread. There is a wide variety of architectures for such machines. 
Some have the multiple processors share a common memory, with a variety of 
possible designs; others are based on each processor having its own memory and 
being linked in various ways to other processors. Parallel computers often lead to 
quite different types of numerical algorithms than those we have been studying 
for sequential computers, algorithms that can take advantage of several concur- 
rent processors working on a problem. There is little literature available, although 
that is changing quite rapidly. As a survey of the solution of the linear systems 
associated with partial differential equations, on both vector and parallel com- 
puters, see Ortega and Voigt (1985). For a proposed text for the solution of linear 
systems, see Ortega (1987). 
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Problems 


1. Solve the following systems Ax = 5b by Gaussian elimination without 
pivoting. Check that A = LU, as in (8.1.5). 


11 -1 1 
(ay) A={ 12 -2 b=1|0 


2: a! 1 
4 3 2 1 1 
_ 13 4 3 2 a OL 
OAs 2 3 4 3 ? -1 
I 2 3 4 -1 
1 1 1 -1 0 
_{-1 3 -3 3 ei 2 
@ A=| 4 4 ek b=} _5 
-3 7 -10 14 8 


2. Consider the linear system 
6x, + 2x, + 2x, = —2 
2x, + 4x, + 4x, =1 
xX, +2x,- x; =0, . 
and verify its solution is 


xy = 2.6 x2 = —3.8 x3 — —5.0 


(a) Using four-digit floating-point decimal arithmetic with rounding, solve 
the preceding system by Gaussian elimination without pivoting. 
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(b) Repeat part (a), using partial pivoting. In performing the arithmetic 
operations, remember to round to four significant digits after each 
operation, just as would be done on a computer. 


(a) Implement the algorithms Factor and Soive of Section 8.2, or imple- 
ment the analogous programs given in Forsythe and Moler (1967, 
chaps. 16 and 17). 


(b) To test the program, solve the system Ax = of order n, with 
A = [(a;,,] defined by 


a;, = Max (i, /) 


Also define 6 = [1,1,...,1]7. The true solution is x = 
{0,0,...,0,(1/n)]7. This matrix is taken from Gregory and Karney 


(1969, p. 42). 
Consider solving the integral equation 


{ 
Ax(s) — i cos(mst)x(t) dt = 1 0O=<=s=l 


by discretizing the integral with the midpoint numerical integration rule 
(5.2.18). More precisely, let n > 0, A = 1/n, t; = (i — 4)hk fori=1,..., 7. 
We solve for.approximate values of x(7;),...,x(t,) by solving the linear 
system ; , 


Az; — Y hcos(at,t;)z;=1 i=1,...,n 
jm 


Denote this linear system by (AJ — K,,)z = 6, with K, of order n X n, 
(Kym hecos(aig,) bal Lsijsn 


For sufficiently large n, z; = x(t;), 1 < i <n. The value of 4 is nonzero, 
and it is assumed here to not be an eigenvalue of K,,. 

Solve (AJ — K,,)z = 6 for several values of n, say n = 2,4, 8, 16, 32, 64, 
and print the vector solutions z. If possible, also graph these solutions, to 
gain some idea of the solution function x(s) of the orginal integral 
equation. Use A = 4, 2,1, .5. : 


(a) Consider solving Ax = b, with A and 6 complex and order(A) = 7. 
Convert this problem to that of solving a real square system of order 
2n. Hint: Write A=A,+ iA,, b= 6, + iby, x =x, + ix,, with 
A, Az; by, by, x1, X2 all real. Determine equations to be satisfied by 
xX, and x. 


- (b) Determine the storage requirements and the number of operations for 


the method in (a) of solving the complex system Ax = b. Compare 
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10. 


11. 


12. 


these results with those based on directly solving Ax = 5 using 
Gaussian elimination and complex arithmetic. Note the greater ex- 
pense of complex arithmetic operations. 


Let A, B, C be matrices of orders m X n,n X p, p X q, respectively. Do an 
operations count for computing A(BC) and (AB)C. Give examples of 
when one order of computation is preferable over the other. 


(a) Show that the number of multiplications and divisions for the 
Gauss—Jordan method of Section 8.3 is about $n°. 


(b) Show how the Gauss—Jordan method, with partial pivoting, can be 
used to invert an 1 Xn matrix within only n(n + 1) storage loca- 
tions. Can complete pivoting be used? 


Use either the programs of Problem 3(a) or the Gatiss—Jordan method to 
invert the matrices in Problems 1 and 3(b). 


Prove that if A = LL’ with L real and nonsingular, then A is symmetric 
and positive definite. 


Using the Choleski method, calculate the decomposition A = LL’ for 
15 -18 15 -3 


225-30 45 
@ |-30 so -100} @ {~j8 % —iR 8 
45 -100 340 Dy eee 


Let A be nonsingular. Let, d = LU = LDM, with all 1,,,m,; = 1, and D 
diagonal. Further assume A is symmetric. Show that M =L’, and thus 
A= LDL’. Show A is positive definite if and only if all d,; > 0. 


Let A be real, symmetric, positive definite, and of order n. Consider solving 
Ax = b using Gaussian elimination without pivoting. The purpose of this 
problem is to justify that the pivots will be nonzero. 


(a) Show that all of the diagonal elements satisfy a;; > 0. This shows that 
a,, can be used as a pivot element. 


(b) After elimination of x, from equations 2 through n, let the resulting 
matrix A® be written as © 


4, 42 77° ain 
Fic ; 
. A®@). 
: 


Show that 4@-is symmetric and positive definite. 


13. 


14. 


15. 
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This procedure can be continued inductively to each stage of the 
elimination process, thus justifying | the existence of nonzero pivots at every 
step. Hint: To prove A® is positive definite, first prove the identity 


ys ax jx; = 3 Gj jX;Xj — 
i, ju2 i,j=l 


for any choice of x,, X>,.--, X,- Then choose x, suitably. 

As. another approach to developing a compact method for producing the 
LU factorization of A, consider the following matrix-oriented approach. 
Write 


ct a 


a=|4 ‘| c,dER"! aeER 
and A square of order n — 1. Assume A is nonsingular. As a step in an 
induction process, assume A=LU is known, -with A nonsingular. Look 
for A = LU in the form 


£ 0 U ; nL 
A= m,geER ER 
e dE Y , ‘ 


Show that m, gq, and y can be found, and describe how to do so. (This 
method is applied to an original A, factoring each principal submatrix in 
the upper left comer, in increasing order.) 


Using the algorithm (8.3.23)—-(8.3.24) for solving tridiagonal systems, solve 
Ax = b with 


2 -1 0 0 0 3 
1 oat | 0 0 e -2 
A=]0 1 2 -i1 0 b= 2 
0 0 1 2 -1 —2 
0 0 860 1 2 1 


Check that the hypotheses and conclusions of Theorem 8.2 are satisfied by 
this example. 


Define the order n tridiagonal matrix 


2 ~=1 0 -*+ 0 
-1 2 71 0 


A= 0 -1 2 -1 
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16. 


17. 


18. 


19. 


20. 


21. 


Find a general formula for A, = LU. Hint: Consider the cases n = 3, 4,5, 
and then guess the general pattern and verify it. 


Write a subroutine to solve tridiagonal systems using (8.3.23)—(8.3.24). 
Check it using the examples in Problems 14 and 15. There are also a 
number of tridiagonal systems in Gregory and Karney (1969, chap. 2) for 
which the true inverses are known. 


There are families of linear systems A,x = b in which A, changes in some 
simple way into a matrix A,,,, and it may then be simpler to find the LU 
factorization of A,,, by modifying that of A,. As an example that arises in 
the simplex method for linear programming, let A, =[a,,...,a,] and 
A = [@p,---, 241], with all a; € R”. Suppose A, = LU, is known, with 
L, lower triangular and U, upper triangular. Find a simple way to obtain 
the LU factorization A, = L,U, from that for A,, assuming pivoting is not 
needed. Hint: Using L,u; = a;, 1 < i <n, write 


A,= L,[u, U3,.-.U,, Eee || = LU 


Show that U can be simply modified into an upper triangular form U,, and 
that this corresponds to the conversion of L, into the desired L,. More 
precisely, U, = MU, L, = L,M~'. Show that the operation cost for obtain- 
ing A, = LU, is O(n?). 


(a) Calculate the condition numbers cond (A) p P= 1,2, 0, for 


(b) Find the eigenvalues and eigenvectors of A, and use them to illustrate 
the remarks following (8.4.8) in Section 8.4. 


Prove that if A is unitary, then cond(A), = 


Show that for every A, the upper bound in (8.4.4) can be attained for 
suitable choices of b and r. Hint: From the definitions of |{.Aj| and ||.471|| 
in Section 7.3, there are vectors X and ? for which || AxX|| = |]Al] ||X{] and 
|A~2F|| = || A FI]. Use this to complete the construction of equality in 
the upper bound of (8.4.4). ; 


The condition number cond (A), of (8.4.6) can be quite small for matrices 
A that are ill-conditioned. To see this, define the n X n matrix 


1-1 -1 + -1 

0 1 -1 + =1 

Age ls ta :| 
1 -1 


22. 


25. 
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Easily cond(A), = 1. Verify that A;! is given by the upper triangular 
matrix B = [b;,}, with b,, = 1, 


b= 2) i<jen 


Compute cond (A),,. 


As in Section 8.4, let H, = [1/(i + j — 1)] denote the Hilbert matrix of 
order n, and let H, denote the matrix obtained when H,, is entered into 
your computer in single precision arithmetic. To compare HZ! and H,', 
convert H, to a double precision matrix by appending additional zeros to 
the mantissa of each entry. Then use a double precision matrix inversion 
computer program to calculate H>! numerically. This will give an accurate 
value of H~! to single precision accuracy, for lower values of n. After 
obtaining H,', compare it with H>1, given in (8.4.13) or Gregory and 
Karney (1969, pp. 34-37). 


Using the programs of Problem 3 or the LINPACK programs SGECO and 
SGESL, solve H,,x = b for several values of n. Use b = {1, —1,1, —1,... 17, 
and calculate the true answer by using H7* from Problem 22. Comment on 
your results. 


Using the residual correction method, described at the beginning of Section 
8.5, calculate accurate single precision answers to the linear systems of 
Problem-23. Print the residuals and corrections, and examine the rate of 
decrease in the correction terms as the order 7 is increased. Attempt to 
explain your results. 


Consider the linear system of Problem 4, for solving approximately an 
integral equation. Occasionally we want to solve such a system for several 
values of 4 that are close together. Write a program to first solve the system 
for Ay = 4.0, and then save the LU decomposition of A)J — K. To solve 
(AI — K)x = b with other values of A nearby Xo, use the residual correc- 
tion method (8.5.3) with C = [LU]~1. For example, solve the system when 
A = 4.1, 4.5, 5, and 10. In each case, print the iterates and calculate the 
ratio in (8.5.11). Comment on the behavior of the iterates as A increases. 


The system Ax = b, 


4 =i 0 -1 0 O 2 

~1 4 -1 #0 -1 0 1 

_| 0-1 4 0 O =-1 _|2 
aml 0 O 4 ~-] 0 ag 2 
0. -1 oO ~1 4 -1 11 

0 oO -1 #O -1 4 2 


has the solution x = [1,1,1,1,1,1]7 Solve the system using the 
Gauss—Jacobi iteration method, and then solve it again using the 
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27. 


29. 


Gauss-Seidel method. Use the initial guess x = 0. Note the rate at which 
the iteration error decreases. Find the answers with an accuracy of « = .0001. 


Let A and B have order n, with A nonsingular. Consider solving the linear 
system 


Az, + Bz, = b, Bz, + Az, = by 
with z,, 22, b,, b, € R’. 


(a) Find necessary and sufficient conditions for convergence of the itera- 
tion method 


Azl™*) = by — Bzi™ Az§m*) = by — Bz” m>0 
(b) Repeat part (a) for the iteration method 
Az") = & — Bef — Azh™* Y= bh — BAY om > 
Compare the convergence rates of the two methods. 


For the error equation (8.6.25), show that r,(M) < lif 


Pil < 4-41 


for some matrix norm. 


For the iteration of a block tridiagonal systems, given in (8.6.30), show 
convergence under the assumptions that 


I : 1 
Ad + ICii< >, 2sisr-8 IAll< 2a 


lei < ie B= 


I . 
Br il 
Bound the rate of convergence. 


Recall the matrix A, of Problem 15, and consider the linear system 
A,x =b. This system is important as it arises in the standard finite 
difference approximation (6.11.30) to the two-point boundary value prob- 
lem 


W(x)=f(ny(x)) a<x<B ya)l=ay p(B) =a, 


It is also important because it arises in the analysis of iterative methods for 
solving discretizations ‘of Poissons equation, as in (8.8.5). In line with this, 
consider using Jacobi’s method to solve A,x = b iteratively. Show that 
Jacobi’s method converges by showing r,(M) <1 for the appropriate 
matrix M. Hint: Use the results of Problem 6 of Chapter 7. 


31. 


32. 
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As an example that convergent iteration methods can behave in unusual 
ways, consider 


xD = b+ Ax, k20 
with 


_|A _ 2 


Assuming |A| < 1, we have (I — A)“ exists and x‘*) > x* = (J — A)71b 
for all initial guesses x. Find explicit formulas for A‘, x* — x'*), and 
x(k*) _ x()_ By suitably adjusting ¢ relative to A, show that it is possible 
for ||x* — x], to alternately increase and decrease as it converges to 
zero. Look at the corresponding values for ||x‘**» — x( ||. For simplic- 
ity, use x = 0 in all calculations. 


(a) Let Cy be an approximate inverse to A. Define Ry = J — AC), and 
assume ||Ro|| < 1 for some matrix norm. Define the iteration method 


Cavi = C(I + Ry) Raa = L- AG a1 m>0 


This is a well-known iteration method for calculating the inverse A a 
Show the convergence of C,, to A~! by first relating the error 
A~!—C,, to the residual R,,. And then examine the behavior of the 
residual R,, by showing that R,,,, = R2,, m= 0. 


(b) Relate C,, to the expansion 


A= C(I Ro) = OL Rh 
j=0 


Observe the relation of this method for inverting A to the iteration 
method (2.0.6) of Chapter 2 for calculating 1/a for nonzero numbers 
a. Also, see Problem 1.of Chapter 2. 


33. Implement programs for iteratively solving the discretization (8.8.5) of 


Poisson’s equation on the unit square. To have a situation for which you 
have a true solution of the linear system, choose Poisson equations in which 
there is no discretization error in going to (8.8.5). This will be true if the 
truncation errors in (8.8.3) are identically zero, as, for example, with 


u(x, y) = x*y?, 


(a) Solve (8.8.5) with Jacobi’s method. Observe the actual error 
|x — x]. im each iterate, as well as ||x°"t — x_. Estimate the 
constant c of (8.7.2), and compute the estimated error bound of 
(8.7.5). Compare with the true iteration error. 
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35. 


37. 


(b) Repeat with the Gauss-Seidel method. Also compare the iteration 
rate c with that predicted by (8.8.13). 


(c) Implement the SOR method, using the optimal acceleration parameter 
w* from (8.8.15). 


(a) Generalize the discretization of the Poisson equation in (8.8.1) to 
the equation 


d7u Au 
Oxf aye Sealy) O<x,y<1 


with u = f(x, y) on the boundary as before. 


(b) Assume c(x, y)20 for O< x, y <1. Generalize part (1) of the 
proof of Theorem 8.8 to show that the linear system of part (a) will 
have a unique solution. 


Implement the conjugate gradient algorithm CG of Section 8.9. Test it with 
the systems of Problems 1, 3, and 4. Whenever possible, for testing 
purposes, uSe systems with a known true solution. Using it, compute the 
true errors in each iterate and see how rapidly they decrease. For the linear 
system in Problem 4, that is based on solving an integral equation, solve the 
system for several values of n. Comment on the results. 


Recall the vector norm ||x||, of (8.9.7), with A symmetric and positive 
definite. Let the eigenvalues of A be denoted by 


0<A,<A,<-:: <A, 
Show that 


Vr ile < ltl S ~Anllxll 


with both equalities attainable for suitable choices of x. Hint: Use an 
orthonormal basis of eigenvectors of A. 


Prove Lemma 1, following (8.9.11). Hint: Use mathematical induction on 
k. Prove it for k = 1. Then assume it is true for k < I, and prove it for 
k = 1+ 1, Break the proof into two parts: (1) p77). = 0 for i < J, and (2) 


T a 
Pisili+1 = 9. 


Let A be symmetric, positive definite, and order‘n Xn. Let U= 
{u,,..-,u,} be a set of nonzero vectors in R". Then if U is both an 
orthogonal set and an A-orthogonal set, then Au; = A,u,;, i= 1,...,n for 
suitable A, > 0. Conversely, one can always choose a set of eigenvectors 
{u,,..., u,,} of A to have them be both orthogonal and A-orthogonal. 


39. 
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Let A be symmetric, positive definite, and of order n. Let {v,,..., u,} be 
an A-orthogonal set in R”, with all v,; # 0. Define 


Showing the following properties for Q,. 


Qjv; = Oif i #7; and Q,v,; = v;. 

Q? = Q,. 

(x, Q;y)4 = (Qj, Y)a, for all x, y © R”. 
(Q;x,(1 — Q;)y¥)4 = 0, for all x, y © R’. 

lll = O,11% + I — Q,)x1|%, for all x © R". 


A aol od a oa 


Properties (2)-(5) say that Q; is an orthogonal projection on the vector 
space R”. with the inner product (-, -),. Define 


S, = Span {v,,...,0,} 
Show that the solution to the minimization problem 
Min||x — yll, 
yES, 
is given by 


k 
Oia [Eo,pe=a 
j=l 


The matrix P, also satisfies properties (2)—(5). 


NINE 


THE MATRIX 
EIGENVALUE PROBLEM 


We study the problem of calculating the eigenvalues and eigenvectors of a square 
matrix. This problem occurs in a number of contexts and the resulting matrices 
may take a variety of forms. These matrices may be sparse or dense, may have 
greatly varying order and structure, and often are symmetric. In addition, what is 
to be calculated can vary enough as to affect the choice of method to be used. If 
only a few eigenvalues are to be calculated, then the numerical metnod will be 
different than if all eigenvalues are required. 

The general problem of finding all eigenvalues and eigenvectors of a nonsym- 
metric matrix A can be quite unstable with respect to perturbations in the 
coefficients of A, and this makes more difficult the design of general methods and 
computer programs. The eigenvalues of a symmetric matrix A are quite stable 
with respect to perturbations in A. This is investigated in Section 9.1, along with 
the possible instability for nonsymmetric matrices. Because of the greater stabil- 
ity of the eigenvalue problem for symmetric matrices and because of its common 
occurrence, many methods have been developed especially for it. This will be a 
major emphasis of the development of this chapter, although methods for the 
nonsymmetric matrix eigenvalue problem are also discussed. 

The eigenvalues of a matrix are usually.calculated first, and they are used in 
calculating the eigenvectors, if these are desired. The main exception to this rule 
is the power method described in Section 9.2, which is useful in calculating a 
single dominant eigenvalue of a matrix. The usual procedure for calculating the 
eigenvalues of a matrix A is two-stage. First, similarity transformations are used 
to reduce A to a simpler form, which is usually tridiagonal for symmetric 
matrices. And second, this simpler matrix is used to calculate the eigenvalues, — 
and also the eigenvectors if they are required. The main form of similarity 
transformations used are certain special unitary or orthogonal matrices, which 
are discussed in Section 9.3. For the calculation of the eigenvalues of a symmetric 
tridiagonal matrix, the theory of Sturm sequences is introduced in Section 9.4 
and the QR algorithm is discussed in Section 9.5.-Once the eigenvalues have been 
calculated, the most powerful technique for calculating the eigenvectors is the 
method of inverse iteration. It is discussed and illustrated in Section 9.6. It should 
be noted that we will be using the words symmetric and nonsymmetric quite 
generally, where they ordinarily should be used only in connection with real 
matrices. For complex matrices, always substitute Hermitian and non-Hermitian, 
respectively. 
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Most numerical methods used at present have been developed since 1950. 
They are nontrivial to implement as computer programs, especially those that are 
to be used for nonsymmetric matrices. Beginning in the mid-1960s, algorithms for 
a variety of matrix eigenvalue problems were published, in ALGOL, in the 
journal Numerische Mathematik. These were tested extensively, and were subse- 
quently revised based on the tests and on._new theoretical results. These al- 
gorithms have been collected together in Wilkinson and Reinsch (1971, part II). 
A project within the Applied Mathematics Division of the Argonne National 
Laboratory translated these programs into Fortran, and further testing and 
improvement was carried out. This package of programs is called EISPACK and 
it is available from the Argonne National Laboratory and other sources (see the 
appendix). A complete description of the package, including all programs, is 
given in Smith et al. (1976) and Garbow et al. (1977). 


9.1 Eigenvalue Location, Error, and Stability Results 


We begin by giving some results for locating and bounding the eigenvalues of a 
matrix A. As a crude upper bound, recall from Theorem 7.8 of Chapter 7 that 


Max |A{ <||A , J. 
se | < |All (9.1.1) 


for any matrix norm. The notation o( A) denotes the set of all eigenvalues of A. 
The next result is a simple computational technique for giving better estimates 
for the location of the eigenvalues of A. 

For A = [a,,] of order n, define 


n= Dla, f= 1,2,....0 (9.1.2) 


j= 
j#i 


~~ 


and let Z; denote the circle in the complex plane with center a;,; and radius r;: 


Z,= {2 €C| |z—4,;;| <7} (9.1.3) 


Theorem 9.1 (Gerschgorin) Let A have order n and let A be an eigenvalue of 
A. Then A belongs to one of the circles Z;. Moreover if m of the 
circles form a connected set S, disjoint from the remaining n — m 
circles, then S contains exactly m of the eigenvalues of A, counted 
according to their multiplicity as roots of the characteristic poly- 
nomial of A. > 
’ Since A and A’ have the same eigenvalues and characteristic 
polynomial, these results are also valid if summation within the 
column, rather than in the row, is used in defining the radii in 


(9.1.2). 
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Figure 9.1 Example of Gerschgorin circle theorem. 


Proof Figure 9.1 gives a picture in the complex plane of what the circles might 
look like for a complex matrix of order three. The solid circles are the 
ones given by (9.1.3), and the dotted ones occur later in the proof. 
According to the theorem, there should be one eigenvalue in Z,, and two 
eigenvalues in the union of Z, and Z,. 

Let A be an eigenvalue of A, and let x be a corresponding eigenvec- 
tor. Let & be the subscript of a component of x for which 


[x4] = Max |x;] = }}XI1.0 
lsisn 
Then from Ax = Ax, the kth component yields 


n 
25 Ay Xj = Ax, 
j=l 


(A = ayy) X_ = 2 Oy jXj 


lA 


n 
IA = gal axl < Jag sl Le] S Tell lleo 
j=l 


S#k 


Canceling ||x||,, proves the first part of the theorem. 
Define 


D = diag [a,,, 4795--+5 Ann] E=A-D 
For 0 < « < 1, define 


A(e)=D+eB (9.1.4) 
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and denote its eigenvalues by A,(€),...,A,(¢). Note that A(1) = A, the 
original matrix. The eigenvalues are the, roots of the characteristic 
polynomial 


f(A) = det [A(e) — AZ] 


Since the coefficients of f(A) are continuous functions of e, and since 
the roots of any polynomial are continuous functions of its coefficients 
[see Henrici (1974) p. 281], we have that A,(e),...,A,(¢) are continuous 
functions of «. As the parameter ¢ changes, each eigenvalue A,(e) will 
vary in the complex plane, marking out a path from A,(0) to A,(1). 

From the first part of the theorem, we know the eigenvalues A,(€) are 
contained in the circles 


Ze) = {zEC||z—a,\ <er,} i=1;...,n (9.1.5) 


with r, defined as before in (9.1.2). Examples of these circles are given in 
Figure 9.1 by the dotted circles. These circles decrease as « goes from 1 
to 0, and the eigenvalue A ,(€) must remain within them. When e = 0, the 
eigenvalues are simply 


A, (0) = ai; 


.Let S be a connected union of m of the circles, with S disjoint from th 
remaining n — m circles. Each path J; = {A,(e)|0 < € < 1}, which begins at 
center a,;,; within S, must remain S. Since there are m such paths, the-number o 


eigenvalues A,(1) will remain at m. This proves the second result. 
Since 


det [A — AJ] = det[A — AI]” = det[.A7 — AT] 


we have o(A) = 0(A’). Thus apply the theorem to the rows of AT in 
order to prove it for the columns of A. This completes the proof. | 


This theorem can be used in a number of ways, but we just provide two simple 
numerical examples. 


Example Consider the matrix 


From the preceding theorem, the eigenvalues must be contained in the circles 
|A- 4] <1 JA| <2 jA+ 4] <2 (9.1.6) 


Since the first circle is disjoint from the remaining ones, there must be a single 
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root in the circle. Since the coefficients of 
f(d) = det [A - AT] 


are real, the complex eigenvalues must occur in conjugate pairs, if they occur at 
all. This will easily imply, with (9.1.6), that there is a real eigenvalue in the 
interval [3,5]. The last two circles touch at the single point (— 2,0). Using the 
same reasoning as before, the eigenvalues in these two circles must be real. And 
by using the construction (9.1.4) of A(e), «€ <1, there is one eigenvalue in 
[—6, —2] and one in [—2,2]. Since it is easily checked that A = —2 is not an 
eigenvalue, we can conclude that A has one real eigenvalue in each of the 
intervals 


[-6, —2), (—2, 2], [3,5] 
The true eigenvalues are 


— 3.76010, — .442931, 4.20303 


Example Consider 


4 1 0 0 
1 4 10 
0 1 41 
A=|. : (9.1.7) 
1 4 1 
0 Go ob 54 


a matrix of order n. Since A is symmetric, all eigenvalues of A are real. The radii 
r, of (9.1.2) are all 1 or 2, and all the centers of the circles are a,;, = 4. Thus from 
the preceding theorem, the eigenvalues must all lie in the interval [2, 6]. Since the 
eigenvalues of A~? are the reciprocals of those of A, we must have 


Ale 
Ne 


<ps< 


for all eigenvalues » of A~'. Using the matrix norm (7.3.22) induced by the © 
Euclidean vector norm, we have 


1 


NAM = (47) <5 


independent of the size of n. 


Bounds for perturbed eigenvalues Given a matrix A, we wish to perturb it and 
to then observe the effect on the eigenvalues of A. Analytical bounds are derived 
for the perturbations in the eigenvalues based on the perturbations in the matrix 
A. These bounds also suggest a definition of condition number that can be used to 
indicate the degree of stability or instability present in the eigenvalues. 
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To simplify the arguments considerably we assume that the Jordan canonical 
form of A is diagonal (see Theorem 7.6): 


P-'4P = diag[X,,...,A,] =D (9.1.8) 


for some nonsingular matrix P. The columns of P will be the eigenvectors of A, 
corresponding to the eigenvalues ,,..., A,,. Matrices for which (9.1.8) holds are 
the most important case in practice. For a brief discussion of the case in which 
the Jordan canonical form is not diagonal, see the last topic of this section. 

We also need to assume a special property for the matrix norms to be used. 
For any diagonal matrix 


G = diag[g,,..-, 8,] 


we must have that 


Gl] = Max |g; (9.1.9) 


All of the operator matrix norms induced by the vector norms ||x||,, 1 < p < 00, 
have this property. We can now state the following result. 


Theorem 9.2 (Bauer—Fike) Let A be a matrix with a diagonal Jordan canonical 
form, as in (9.1.8). And assume the matrix norm satisfies (9.1.9). Let 
A+E be a perturbation of A, and let A be an eigenvalue of 
A+ E. Then 


Min JA — Ajj < {PIP WEL (9.1.10) 


lsisn 


Proof if 2 is also an eigenvalue of A, then (9.1.10) is trivially true. Thus 
assume A #A,,A,,...,A,, and let x be an eigenvector for A +E 
corresponding to A. Then 


(A + E)x=dx 
(AI — A)x = Ex 
Substitute from (9.1.8) and multiply by P~! to obtain 
(AI — PDP~)x = Ex 
‘(AI — D)( Pox) = (PEP) P7!x) 
Since A #A,,...,A,, AJ — D is nonsingular, 
(AI — D)7* = diag [(A —A,)7,...,(A- 7 
Then 
pox = (AI — D)7'( PEP) P7'x) 


Pox] < (AL — D) "| PEP IP xl 
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Canceling ||P~‘x|| and using (9.1.9), 
1 < [Max JA — A ~*] HP“ IPILZI 
This is equivalent to (9.1.10), completing the proof. a 
Corollary If A is Hermitian, and if A + E is any perturbation of A, then 


Min |A —A,| < ||Ejl, (9.1.11) 
lsisn 
for any eigenvalue A of A + E. | 


Proof Since A is Hermitian, the matrix P can be chosen to be unitary. And 
using the operator norm (7.3.19) induced by the Euclidean vector norm, 
Pilz = ||P 7"I], = 1 (see Problem 13 of Chapter 7). This completes the 
proof. | 


The statement (9.1.11) proves that small perturbations of a Hermitian matrix 
lead to equally small perturbations in the eigenvalues, as was asserted in the 
introduction to this chapter. Note that the relative error in some or all of the 
eigenvalues may still be large, and that this occurs commonly when the eigenval- 
ues of a matrix vary greatly in magnitude. 


Example Consider the Hilbert matrix of order three, 


H; = (9.1.12) 


Wile Nie oe 
Al wl ere nN] eH 
Ul Re ale wl 


Its eigenvalues to seven significant digits are 
A, = 1.408319 A, = .1223271 A, = .002687340 (9.1.13) 


Now consider the perturbed matrix Hs, representing H, to four significant digits: 


H,=| 5000 3333 .2500 (9.1.14) 


3333 2500 = .2000 


1.000 5000 2 


Its eigenvalues to seven significant digits are 
XA, = 1.408294 =A, = 1223415 = A, = 002664489 ~—_ (9.1.15) 
To verify the validity of (9.1.11) for this case, it is straightforward to calculate 


WEll, =7,(E) = 4 x 107* = 000033 
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For the errors and relative errors in (9.1.15), 


A, — A, = 0000249 Rel (A,) = .0000177 


A, -A, = ~.0000144 ~—- Rel (A, ) = —.000118 


lt 


A; — A; = .0000229 Rel(A;) = .0085 


All of the errors satisfy (9.1.11). But the relative error in A 3 is quite significant 
compared to the relative perturbations in H;. 


For a nonsymmetric matrix A with P as in (9.1.8), the number 
K(A) = ||P Po 


will be called the condition number for the eigenvalue problem for A. This is 
based on the bound (9.1.10) for the perturbations in the eigenvalues of the matrix 
A when it is perturbed. Another choice, even more difficult to compute, would be 


to use 


K(A) = Infimum || P| [|P~" (9.1.16) 


with the infimum taken over all matrices P for which (9.1.8) holds and over all 
matrix norms satisfying (9.1.9). The reason for having condition numbers is that 
for nonsymmetric matrices A, small perturbations E can lead to relatively large 


perturbations in the eigenvalues of A. 


Example To illustrate the pathological problems that can occur with nonsym- 
metric matrices, consider 


_f101 90 iia S002¢ 
4-{1 aI Ae Ee —98 OAT) 


_ The eigenvalues of A are \ = 1,2, and the eigenvalues of A + E are 


3—e+ Vl — 828e + €? 
A= —— 


2 
As a specific example to give better intuition, take « = .001. Then 


A+ EF = | 100.999 Bs 


110 — 98 


and its eigenvalues are 


A = 1.298, 1.701 (9.1.18) 


This problem should not be taken to imply that nonsymmetric matrices are 
ill-conditioned. Most cases in practice are fairly well-conditioned. But in writing 
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a general algorithm, we always seek to cover as many cases as possible, and this 
example shows that this is likely to be difficult for the class of all nonsymmetric 
matrices. 


For symmetric matrices, the: result (9.1.11) can be improved upon in several 
ways. There is a minimax characterization for the eigenvalues of symmetric 
matrices. For a discussion of this theory and the resultant error bounds, see 
Parlett (1980, sec. 10.2) or Wilkinson (1965, p. 101). Instead, we give the 
following result, which will be more useful for error analyses of methods 
presented later. 


Theorem 9.3 (Wielandt-Hoffman) Let A and E be real, symmetric matrices of 
order n, and define A = A + E. Let A, and A, i=1,...,n, be the 
eigenvalues of A and A, respectively, arranged i in increasing order. 
Then 


where F(£) is the Frobenius norm of E , defined in (7.3.10). 


(a,- uy < F(E) (9.1.19) 


na 
iMs 


Proof See Wilkinson (1965, pp. 104-108). | 


This result will be used later in bounding the effect of the rounding errors that 
occur in reducing a symmetric matrix to tridiagonal form. 


A computable error bound for symmetric matrices Let A be a symmetric matrix “ 
for which an approximate eigenvalue A and approximate eigenvector 4 x have been 
computed. Define the residual 
n = Ax — dx (9.1.20) 
Since A is symmetric, there is a unitary matrix U for which 
U*AU = diag[A,,...,A,] =D (9.1.21) 


Then we will show that 


limite 
lsisn II*ll2 


(9.1.22) 


Using (9.1.21) 
n = UDU*x — rx 
U*n = DU*x — U*x = (D ~ AI)U*x 


If is an eigenvalue of A, then (9.1.22) is trivially true. Thus there is no loss of 
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generality in assuming A # A,,...,A,- Thus D — AJ is nonsingular, and 


U*x = (D— AI)" 'U*y 
|U*xl]2 < (DD — AT)" "olU*nlls 
Recall Problem 13 of Chapter 7, which implies 
WO*xll2=Ixll2 = U* nla = Nall 


Then using the definition of the matrix norm, we have 
lalla < | Max |(-2,)*| ial 


which is equivalent to (9.1.22). 
The use of (9.1.22) is illustrated later in (9.2.15) of Section 9.2, using an 
approximate eigenvalue—eigenvector pair produced by the power method. 


Stability of eigenvalues for nonsymmetric matrices In order to deal effectively 


with the potential for instability in the nonsymmetric matrix eigenvalue problem, 

it is necessary to have a better understanding of the nature of that instability. For 

example, one consequence of the analysis of instability will be that unitary 

similarity transformations will not make worse the conditioning of the problem. 
As before, assume that A has a diagonal Jordan canonical form: 

P-\4P = diag[A,,...,A,] =D (9.1.23) 

Then Aj,,...,A,, are the eigenvalues of A, and the columns of P are the 


corresponding. eigenvectors, call them u,,...,u,. The matrix P is not unique. 
For example, if F is any nonsingular diagonal matrix, then 


(PF)"14(PF) = F'DF = D 


By choosing F appropriately, the columns of PF will have length one. Thus 
without loss of generality, assume the columns of P will have length one: 


Au;=hju;  utu;= 1 i=1,...,n (9.1.24) 
with 
P=[u,...,u,] 


By taking the conjugate transpose in (9.1.23), 
P*A*(P*)~* = D* = diag [2,,..., A, | 


which shows that the eigenvalues of A* are the complex conjugates of those of A. 
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Writing 
(P*)~* = [w,,..., w] (9.1.25) 
we have 
Atw,=A,w, i=1,...,0 (9.1.26) 
Equivalently, by forming the conjugate transpose, 
wd = \,w* (9.1.27) 


This says w* is a left eigenvector of A, for the eigenvalue A. Since P~'P = J, 
and since 


wit 
Pol=] : 
wt 
we have 
wane: 227 (9.1.28) 
py \0 its ir 


This says the eigenvectors {u;} of A and the eigenvectors {w,} of A* form a 
biorthogonal set. ; 
- Normalize the eigenvectors w, by 


w-. 


as 
Define 
1 
Tl ore 
a positive real number. The matrix (P*)~!can now be written . 
(Pt) = ae =| 

And 

Atv, = X,0; Halle = 1 f=],...,” ~ (9.1.30) 


We now examine the stability of a simple eigenvalue 4, of A. Being simple 
means that A, has multiplicity one as a root of the characteristic polynomial of 
A. The results can be extended to eigenvalues of multiplicity greater than one, 
but we omit that case. Consider the perturbed matrix 


A(e)=At+eB €>0 
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for some matrix B, independent of ¢«. Denote the eigenvalues of A(e) by 
A,(e),---,A,(€)- Then : 


P-M(e)P=D+eC C=P BP 


1 
Cy = Pa l<i,j<n (9.1.31) 
We will prove that 
€ 
A, (6) =A, + UE Bu, + O(«?) (9.1.32) 
: k 


The derivation of this result uses the Gerschgorin Theorem 9.1. We also need 
to note that for any nonsingular diagonal matrix F, 


FP7'A(e)PF-' = D+ eFCF™ (9.1.33) 
and this leaves the eigenvalues of A(e) unchanged. Pick F as follows: 
_f/ea i=k 
fu ({ i#tk 
with a a positive constant to be determined later. Most of the coefficients of the 
matrix (9.1.33) are not changed, and only those in row & and column k need to 


be considered. They are 


aC, ; p#k 
[D + «FCF “"],; = ee - 
A, te, J=uk 


1 
[D+eFCF"'],,=—-—c, ik 
a 


Apply Theorem 9.1 to the matrix (9.1.33). The circle centers and radii are 


center=A, +c, “= €7a DI [c,;| 
, j#k 
(9.1.34) 
center = A, + €¢;; r=e Dd leis] + lei] i#zk 
jtick aan 


We wish to pick a so large and « sufficiently small so as to isolate the circle about 
A, + €¢,, from the remaining circles, and in that way, know there is exactly one 
eigenvalue of (9.1.33) within circle k. The distance between the centers of circles 
k and i # k is bounded from below by 


IA; — Axl — €lu — Cee 
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which is about |A, — A,| for small values of ¢. Pick a such that 
1 ; 
5 lie = 4|Ajz— Aji alitk (9.1.35) 


Then choose €, such that for all 0 < € < €9, circle k does not intersect any of the 
remaining circlés. This can be. done because A, is distinct from the remaining 
eigenvalues A, and because of the inequality (9.1.35). 

Using this construction, Theorem 9.1 implies that circle k contains exactly one 
eigenvalue of A(e), call it A,(€). From (9.1.34), 


|A,.(€) — Ay — €Cyx| <7 = O(€7) 


and using the formula for c,, in (9.1.31), this proves the desired result (9.1.32). 
Taking bounds in (9.1.32), we obtain 


€ 
[de(e) — Ag] S —eellallBllalleall2 + OC€?) 
k 
and using (9.1.24) and (9.1.30), 


: . 
|A,(e) — A, | < 5 Bla + O(e?) (9.1.36) 


The number s, is intimately related to the stability of the eigenvalue of 1,, 
when the matrix A is perturbed by small amounts E = ¢B. If A were symmetric, 
we would have u, = v,, and thus s, = 1, giving the same qualitative result for 
symmetric matrices as derived previously. For nonsymmetric matrices, if s, is 
quite small, then small perturbations E = ¢B can lead to a large perturbation in 
the eigenvalue A,. Such problems are called ill-conditioned. ' 


Example Recall the example (9.1.17). Then « = .001, 


Ate PL: 0 _f{-1 -1 
pes E | 2 | 0 | 
9 —10 
yi81~—s “221 pei —11y181 107181 (9.1.37 
10 11 oe |= 5 ei) 
10/221 9221 
y181—s 221 


If we use the row norm to estimate the condition number of (9.1.16), then 


P= 


K(A) <||PllollP ‘lee = 419 


The columns of P give u,, u,, and the columns of (P~')? give the vectors w,, w, 
[see (9.1.25)]. 
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To calculate (9.1.36) for A, = 1, 


1 1 
s = ——— = ————— = 005 
roeh* twilla — (221)(181) 
and ||B||, = ¥2. Formula (9.1.36) yields 
eS 
jA,(e) - A, |< —= ae O(c?) = 283 + O(c?) (9.1.38) 


The actual error is A,(. 001) — A, = 1.298 — 1 = .298, and the preceding gives the 
error estimate of 283¢ = .283. Thus (9.1.36) is a reasonable estimated bound of 


the error. 


In the following sections, some of the numerical methods first convert the 
matrix A to a simpler form using similarity transformations. We wish to use 
transformations that will not make the numbers s, even smaller, which would 
make. an ill-conditioned problem even worse. From this viewpoint, unitary or 
orthogonal transformations are the best ones to use. 

Let U be unitary, and let A = U*AU. For a simple eigenvalue A,, let s, and 
§, denote the numbers (9.1.29) for the two matrices A and ‘A. If {u,} and {v;} 
are the normalized eigenvectors of A and A*, then {U*u;} and {U*v,} are the 
corresponding eigenvectors for A and A*, For Sys 


= (U*v,)*(U*u,) = vtUU*u, = vfu, 
= (9.1.39) 


=5, 
Thus the.stability of the eigenvalue A, is made neither better nor worse. Unitary 
transformations also preserve vector length and the angles between vectors (see 
Problem 13 of Chapter 7). In general, unitary matrix operations on a given 
matrix A will not cause any deterioration in the conditioning of the eigenvalue 
problem, and that is one of the major reasons that they are the preferred form of 
similarity transformation in solving the matrix eigenvalue problem. 

Techniques similar to the preceding, in (9.1.23)—(9.1.36), can be used to give a 
stability result for eigenvectors of isolated eigenvalues. Using the same assump- 
tions on {A;}, {u; = u,(0)}, and {v,;}, consider the eigenvector problem , 


(A + «B)u,(e) =, (e)u,(€) lee) =1 (9.1.40) 


with A, a simple eigenvalue of A. Then 


us o* Bu 
u,(€) =u, + ea,u, ted feeene + O(e*) (9.1.41) 
jal Le cae | 
jak 


The proof of this is left to Problem 6. 
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The result (9.1.41) shows that stability of u,(€) depends on the condition 
numbers s, and on the nearness of the eigenvalues A, to A,. This indicates 
probable instability, and further examples of this are given in Problem 7. A 
deeper examination of the behavior of eigenvectors when A is subjected to 
perturbations requires an examination of the eigenvector subspaces and their 
relation to each other, especially when the eigenvalue A, is not simple. For more 
on this, see Golub and Van Loan (1983, pp. 203-207, 271-275). 


Matrices with nondiagonal Jordan canonical form We have avoided discussing 
the eigenvalue problem for those matrices for which the Jordan canonical form is 
not diagonal. There are problems of instability in the eigenvalue problem, worse 
than that given in Theorem 9.2. And there are significant problems in the 
determination of a correct basis for the eigenvectors. 

Rather than giving a general development, we examine the difficulties for this 
class of matrices by examining one simple case in detail. Let 


i 1 -0 --- 0 
Or FT 0 
Ae de (9.1.42) 
1 
0 es 0 1 


a matrix of order n. The characteristic polynomial is 


f(A) =(1-A)" 


and A =1 is a root of multiplicity n. There is only a one-dimensional set of 
eigenvectors, spanned by 


x = [1,0,...,0]” (9.1.43) 
For ¢ > 0, perturb A to 
1 oJ] 0 0 
0 1 1 0 
A(e) = 
, 0 1 1 
e 0 0 1 


Its characteristic polynomial is 


LAYS (leh) trl )n 
There are n distinct poole 


A, (ce) =lLtogeV/" k=1,...,2 (9.1.44) 
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with {w,} the mth roots of unity, 


w, = e2tki/n k=1l,...,n 
For the perturbations in the eigenvalue of A, 
|A,(€) —A,[ =" (9.1.45) 
For example, if n = 10 and e = 10~?°, then 
|A,(e) — A, | = 1 (9.1.46) 


The earlier result (9.1.10) gave a bound that was linear in e, and (9.1.45) is much 
worse, as shown by (9.1.46). 

Since A(e) has n distinct eigenvalues, it also has a complete set of n linearly 
independent eigenvectors, call them x,(€),..., x,(€). The first thing that must be 
done is to give the relationship of these eigenvectors to the single eigenvector x in 
(9.1.43). This is a difficult problem to deal with, and it always must be dealt with 
for matrices whose Jordan form is not diagonal. 

The matrices A and A(e) are in extremely simple form, and they merely hint 
at the difficulties that can occur when a matrix is not similar to a diagonal matrix. 
In actual practice, rounding errors will always ensure that such a matrix will have 
distinct eigenvalues. And this example is correct from the qualitative point of 
view in showing the difficulties that will arise. 


9.2 The Power Method 


This is a classical method, of use mainly in finding the dominant eigenvalue and 
associated eigenvector of a matrix. It is not a general method, but is useful in a 
number of situations. For example, it is sometimes a satisfactory method with 
large sparse matrices, where the methods of later sections cannot be used because 
of computer memory size limitations. In addition the method of inverse iteration, 
described in Section 9.6, is the power method applied to an appropriate inverse 
matrix. And the considerations of this section are an introduction to that later 
material. It is extremely difficult to implement the power method as a general- 
purpose computer program, treating a large and quite varied class of matrices. 


But it is easy to implement for more special classes. - 


We assume that A is a real n X n matrix for which the Jordan canonical form 
is diagonal. Let A,,..., A,, denote the eigenvalues of A, and let x,,..., x, be the 
corresponding eigenvectors, which form a basis for C". We further assume that 


Ay] > Ao} = JA3] = +> = 1A,] 2 0 (9.2.1) 


Although quite special, this is the main case of interest for the application of the 
power method. And the development can be extended fairly easily to the case of 
a single dominant eigenvalue of geometric multiplicity r > 1 (see Problem 10). 
Note that these assumptions imply A, and x, are real. 
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Let z be a real initial guess of some multiple of x,. If there is no rational 
method for choosing z®, then use a random number generator to choose each 
component. For the power method, define 


wl = Azim-D (9.2.2) 
Let g,, be a component of w‘”) that is maximum in size. Define 
wm) 
2M =— m>1 (9.2.3) 
Bm 
We show that the vectors {z‘")} will approximate 6,x/|x||, with each 


6,, = +1, as m > oo. 
We begin by showing that 


Am™z© 
2) = om Amz o,=+1 m21 (9.2.4) 
First, w® = 4z©, 
By = Ow], = 4z, 1 = +1 
Then. 
wa) Az 


Q) = = 
z® = — =o,—~— 
Bi Az. 


The proof of (9.2.4) for general m > 1 uses mathematical induction. For the case 
m = 2, as an example, 


AvZ 
w?) = Az? =¢ 
Az 
472]. 1 
B2=# ao &=t 
Az, 
ss w Atz™ pi]A2z | Atz© 
a = 9; on 0 = 927-72, 0) 
B2 AzI]., Az}. A722 I 


with o, = o,p. The case of general m is essentially the same. 
To examine the convergence of {z‘”)}, first expand z using the eigenvector 
basis {x;}: 


a" 
0) — 
z >» a,x; 
jal 
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We will assume a, # 0, which a random choice of z® will generally ensure. Also 
a, can be shown to be real. Then 


n 


n 
m,(0) my, — m 
A™z > a,A™X; > a ATX; 


j=) j=l 
m mL Ay\" 
=A} qx,+ La; | % (9.2.5) 
jm2 1 
From (9.2.1), 
A;\” 
i -0 as m>o 25j<n (9.2.6) 
1 


Using this in (9.2.5) and (9.2.4), we have 


aL \ GX x 
sims | “| a eG (9.2.7) 


IAdl lay} Nl. ™ WXaHleo 


Then o,, = +1, and generally it is independent of m. This will be the case if x, 
has a unique maximal component. In cases with x, having more than one 
maximal component, it is possible that ¢,, will vary with m (see Problem 9). The 
rate of convergence in (9.2.7) depends on |A3/),|: 


m 


dy 
— 9.2.8 
a (9.2.8) 


xy 


zm g . 
Nrlleo 


m 


oo 
since all of the remaining ratios |A ,/A,| are bounded by |A,/A,|. 


To obtain a sequence of approximate eigenvalues, let k be the index of a 
nonzero component of x,. Generally, we pick k as the subscript of the compo- 


" nent a,,, and thus it will possibly vary with m. Define 


‘ Ww 
v= ——-  m21 (9.2.9) 
Zz 
To examine the rate of convergence, use (9.2.4) and (9.2.5): 


Az Any 1710) 
ym On, * m= 15(O)} | = Oy, * m— 1510) 
[An 'z ; Ame Ze ‘ 


(9.2.10) 


‘| (9.2.11) 
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The rate of convergence is linear, and it depends on |A,/A,|. The index k is 
usually chosen fixed. If we choose k as the index of the maximal component a,, 
of w'™, and if x, has a single maximum component, then k will become 
constant as m — oo. With more than one maximal component, it can move 
about, as is shown in Problem 9. An alternative method of defining A\”) is given 
in Conte and de Boor (1980, p. 192), avoiding the need to select a particular 
component index: 


The vector u is to satisfy u7x, # 0, and a random choice of wu will generally 
suffice. The error in A{”) will again satisfy (9.2.11). 


Example Let 
1 2 3 
A=|2 3 4 (9.2.12) 


The true eigenvalues are 
A, = 9.623475383 . A, = — 6234753830 A, =0 (9.2.13) 


An initial guess z® was generated with a random number generator. The first 
five iterates are shown in Table 9.1, along with the ratios 


-1 
Ay” = Dag ) 


m Ner—D — uw=2) (9.2.14) 


The iterates A{”) were defined using (9.2.9), with k = 3. According to a later 
discussion, these ratios should approximate the ratio A,/A, as m — 00. 

We use the computable error bound (9.1.22) that was derived earlier in Section 
9.1. Calculate 


y = Ax - Ax 
with . 


x= x0 A= dr) 


Table 9.1 Example of the power method 


m zim zo”) zs™ ”) R,, 

1 .50077 -75038 1.0000 11.7628133 

2 .52626 76313 1.0000 9.5038496 

3 52459 .76230 1.0000 _— 9.6313231 — 05643 
4 -52470 -16235 1.0000 9.6229674 — 06555 
5 52469 -76235 1.0000 9.6235083 — 06474 
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Then from (9.1.22), 


Min |A; -— AP] < a = 3.30 x 10-5 (9.2.15) 
2 


A direct comparison with the true answer ), gives 
A, — A® = — 0000329 
which shows that (9.2.15) is a very accurate estimate in this case. 


Acceleration methods Since there is a known regular pattern with which the 
error decreases for both \{”) and z‘”), this can be used to obtain more rapidly 
convergent methods. We give three different approaches for accelerating the 
convergence. 


Case 1. Translation of the Eigenvalues. Choose a constant b, and replace the 
calculation of the eigenvalues of A by those of 


B=A-DbI (9.2.16) 


The eigenvalues of B are A;— b, i=1,...,n. Pick b so that A, — b is the 
dominant eigenvalue of B, and choose 6 to minimize the ratio of convergence. 

As a particular case in order to be more explicit, suppose that all eigenvalues 
of A are real and that they have been so arranged that 


A, >A,2A,2 °°: 2A, A, > 1A, 1 
Then the dominant eigenvalue of B could be either A, — b or X,, — b, depending 
on the size of b. We first require that 5 satisfy 


[Ay — 5] > [A, — 5| 


The rate of convergence will be 


om ae 
IA, — 5] IA, | (9.2.17) 


Max ( ———_, ——-—_ 
fs = b| [Ay i) b| 


If we look carefully at the behavior of these two ratios as b varies, we see that the 
minimum of (9.2.17) occurs when 


(A, - 6) ~ (A, by Siig sb ie [-, = b)| 
and . 
b* = 4(A, +A,) (9.2.18) 
is the optimal choice of b. The resulting ratio of convergence is 
A, — b* A, — o* A.A, 


= te 9.2.19 
A, - oF A, -5* 2A,-A,—A, el) 
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Experimental methods based on this formula and on (9.2.11) can be used to 
determine approximate values of b*. 

Transformations other than (9.2.16) can be used to transform the set of 
eigenvalues in such a way as to obtain even more rapid convergence. For a 
further discussion of these ideas, see Wilkinson (1965, pp. 570-584). 


Example In the previous example (9.2.12), the theoretical ratio of convergence 
was 


~ 


2 


— + — 0648 
Ay 


Using the optimal value 5 given by (9.2.18), and using (9.2.13) in a rearranged 
order, 


bt = 4(A, +A,,) = —.31174 (9.2.20) 
The eigenvalues of A — b/ are 
9.93522 31174 — 31174 (9.2.21) 


The ratio of convergence for the power method applied to A — b/ will be 


which is less than half the magnitude of the original ratio. 


Case 2. Aitken Extrapolation. The form of convergence in (9.2.11) is com- 
pletely analogous to the linearly convergent rootfinding methods of Section 2.5 of 
Chapter 2. Following the same development as in Section 2.6, we consider the use 
of Aitken extrapolation to accelerate the convergence of {A{”)} and {z‘”)}. To 
use the following development, we must assume in (9.2.1) that 


JA2] # |As| (9.2.22) 
This can be weakened to 
[Aa] = JA;| implies A,=A, 
But we do not allow two ratios of convergence of equal magnitude and opposite 
sign (see the preceding example of (9.2.21) for such a case). The Aitken procedure’ 


can also be modified so as to remove the restriction (9.2.22). 
With (9.2.22), and using (9.2.10), 


A, — AG™ = or™ (9.2.23) 


where c¢ is some constant, r is the unknown rate of convergence, and theoreti- 
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cally r = A,/A,. Proceeding exactly as in Section 2.6, implies that 
dun) — Alm 1) 


Sey Cn (9.2.24) 


And Aitken extrapolation gives the improved value A: 


[am — \m- yy? 


— TROT AP=BY = DPD ap 3] m>3 (9.2.25) 


A, =X 


A similar derivation can be applied to the eigenvector approximants, to accel- 
erate each component of the sequence { z‘”)}, although some care must be used. 


Example Consider again the example (9.2.12). In Table 9.1, 
A, 
R, = — .06474 = an .06479 
1 


As an example of (9.2.25), extrapolate with the values \®’, A9), and A from that 
table. Then 


= 9.6234814 A, —A, = —6.03 x 10-6 


In comparison using the more accurate table value A, the error is 4, — AP) = 
—3.29 x 10~°. This again shows the value of using extrapolation whenever the 
theory justifies its use. 


Case 3. The Rayleigh—Ritz Quotient. Whenever A is symmetric, it is better to 
use the following eigenvalue approximations: 


(Az™, z(™)) (women z(™)) 

EA ws re hs 

AY” (2, zim)) (2, z(™)) m>0 (9.2.26) 
We are using standard inner product notation: | 


nn 
(w,z)=)iwz, w,zeR" 
1 


To analyze this sequence (9.2.26), note that all eigenvalues of A are real and 
that the eigenvectors x,,...,x, can be chosen to be orthonormal. Then (9.2.2), 
(9.2.4), (9.2.5), together with (9: 2.26), imply 


x Ia, [Pan t? 
j=] 
Net = J = 
2,2 
y Ja;| x" 
j=l 


AyntD = a +0 


(2)"]| (9.2.27) 
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The ratio of convergence of A{”) to A, is (A,/A,)*, an improvement on the 
original ratio in (9.47) of A,/A,. 

This is a well-known classical procedure, and it has many additional aspects 
that are of use in some problems. For additional discussion, see Wilkinson (1965, 


pp. 172-178). 


Example In the example (9.2.12), use the approximate eigenvector z® in 
(9.2.26). Then 

(4z®, z®) 
~ (z@, 7) 


which is as accurate as the value A9) obtained earlier. 


2) ~ 9.623464 


The power method can be used when there is not a single dominant eigen- 
value, but the algorithm is more complicated. The power method can also be used 
to determine eigenvalues other than the dominant one. This involves a process 
called deflation of A to remove A as an eigenvalue. For a complete discussion of 
all aspects of the power method, see Golub and Van Loan (1983, 208-218) and 
Wilkinson (1965, chap. 9). Although it is a useful method in some circumstances, 
it should be stressed that the methods of the following sections are usually more 
efficient. For a rapidly convergent variation on the power method and the 
Rayleigh—Ritz quotient, see the Rayleigh quotient iteration for symmetric matrices 
in Parlett (1980, p. 70). 


9.3 Orthogonal Transformations Using 
Householder Matrices 


As one step in finding the eigenvalues of a matrix, it is often reduced to a simpler 
form using similarity transformations. Orthogonal matrices will be the class of 
matrices we use for these transformations. It was shown in (9.1.39) that orthogo- 
nal transformations will not worsen the condition or stability of the eigenvalues 
of a nonsymmetric matrix. Also, orthogonal matrices have other desirable error 
propagation properties, an example of which is given later in the section. For 
these reasons, we restrict our transformations to those using orthogonal matrices. 

We begin the section by looking at a special class of orthogonal matrices 
known as Householder matrices. Then we show how to construct a Householder 
matrix that will transform a given vector to a simpler form. With this construc- 


‘ tion as a tool, we look at two transformations of a given matrix A: (1) obtain its 


QR factorization, and (2) construct a similar tridiagonal matrix when A is a 
symmetric matrix. These forms are used in the next two sections in the calcula- 
tion of the eigenvalues of A. As a matter of notation, note that we should be 
restricting the use of the term orthogonal to real matrices. But it has become 
common usage in this area to use orthogonal rather than unitary for the general 
complex case, and we will adopt the same convention. The reader should 
understand unitary when orthogonal is used for a complex matrix. 
Let w € C” with ||w|], = yw*w = 1. Define 


U=I-2ww* (9.3.1) 


This is the general form of a Householder matrix. 
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Example For n = 3, we require 
= T 2 
w=[w,, Ww, 5] Jwy|? + Jw]? + [ws]? = 1 


The matrix U is given by 


1—2|w,|? —2w, 9%, —2w,W, 
U=| -2Ww, 1-2|w,|? —2w5, 
—2W,w; —2W,w, 1 — 2[(w,{? 
For the particular case 
553], 
ie a ee aa 
3 
we have 
1} 7 -4 -4 
U= 9 -4 1 —-8 
-4 -8 1 


We first prove U is Hermitian and orthogonal. To show it is Hermitian, 
U* = (I — 2ww*)* = I* — 2(ww*)* 
= I — 2(w*)*w* =I — 2ww* = U 
To show it is orthogonal, | 
U*U = U2 = (I — 2ww*)? 
= I — 4ww* + 4( ww*)(ww*) 
=! 
since using the associative law and w*w = 1 implies 
(ww*)(ww*) = w(wtw)w* = ww* 


The matrix U of the preceding example illustrates these properties. In Problem 
12, we give a geometric meaning to the linear function T(x) = Ux for U a 
Householder matrix. 

We will usually use vectors w with leading zero components: 


w = [0,..., 0, w,..-, Wal? = [0,-,, W7]7 (9.3.2) 


with # © C"~"*!, Then 


Gel sen | (9.3.3) 
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Premultiplication of a matrix A by this U will leave the first r - 1 rows of A 
unchanged, and postmiultiplication of A will leave its first r — 1 columns un- 
changed. For the remainder of this section we assume all matrices and vectors are 
real, in order to avoid having to deal with possible complex values for w. 

The Householder matrices are used to transform a nonzero vector into a new 
vector containing mainly zeros. Let b # 0 be given, b & R", and suppose we 
want to produce U of form (9.3.1) such that Ub contains zeros in positions r + 1 
through n, for some given r > 1. Choose w as in (9.3.2). Then the first r ~ 1 
elements of b and Ub are the same. 

To simplify the later work, write m = n — r + 1, 


Pe] el 


with c € R™!, v, d © R”. Then our restriction on the form of Ub requires the 
first r — 1 components of Ub to be c, and 


(I -2ve7)d =[a,0,...,0]7 — vl, =1 (9.3.4) 


for some «. Since J — 2vv" is orthogonal, the length of d is preserved (Problem 
13 of Chapter 7); and thus 


la| = lIdll, = S 


ERY Farr (035) 
Define 
p=ov'd 
From (9.3.4), 
oop [a,0,...,0]7 (9.3.6) 
Multiplication by v’ and use of ||v|], = 1 implies 
p= —av, (9.3.7) 


Substituting this into the first component of (9.3.6) gives 


y= (} 2 “) (9.3.8) 


Choose the sign of a in (9.3.5) by 
sign (a) = —sign (d,) (9.3.9) 


This choice maximizes v?, and it avoids any possible loss of significance errors in 
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the calculation of v,. The sign for v, is irrelevant. Having v,, obtain p from 
(9.3.7). Return to (9.3.6), and then using components 2 through m, 


pence j=2,3,...,m (9.3.10) 
2p 


The statements (9.3.5), (9.3.7)-(9.3.9) completely ‘define v, and thus w and U. 
The operation count is 2m + 2 multiplications and divisions, and two square 


' roots. The square root defining v, can be avoided in practice, because it will 


disappear when the matnx ww? is formed. A sequence of such transformations 
of vectors 5 will be used to systematically reduce matrices to simpler forms. 


Example Consider the given vector 
b = [2,2,1]* 
We calculate a matrix U for which Ub will have zeros in its last two positions. To 


help in following the construction, some of the intermediate calculations are 
listed. Note that w = v and b = d for this case. Then 


5 15 
a=-3 v, = 6 p= cy 
2 1 


2 0 3 BO 
The matrix U is given by. 
y] 2 1 
3 3 3 
2 11 2 
U —t —_— — —_—_ — 
3 15 15 
i 2 14 
“3 45 145 
and 
Ub = [—3,0,0]” 


The QR factorization of a matrix Given a real matrix A, we show there is an 
orthogonal matrix Q and an upper triangular matrix R for which 


A=QR (9.3.11) 


P.=I-2WOwOT  p=1,...,n-1 (9.3.12) 
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with w°” as in (9.3.2) with r — 1 leading zeros. Writing A in terms of its columns 
Agy,---, Ann, We have 


PA = [LP Agseess PAs, | 


Pick P, and w™ using the preceding construction (9.3.5)-(9.3.10) with b = Ay,;. 
Then P,A contains zeros below the diagonal in its first column. 

‘Choose P, similarly, so that P,P,A will contain zeros in its second column 
below the diagonal. First note that because w@ contains a zero in position 1, and 
because P, A is zero in the first column below position 1, the products P,P,A and 
‘P,A contain the same elements in row one and column one. Now choose P, 
and w® as before in (9.3.5)-(9.3.10), with b equal to the second column of P,A. 

By carrying this out with each column of A, we obtain an upper triangular 
matrix 


R=P_,-++ PA (9.3.13) 


If at step r of the construction, all elements below the diagonal of column r are 
zero, then just choose P.=J and go onto the next step. To complete the 
construction, define 


Q7=P,_4,°-°: Py (9.3.14) 
which is orthogonal. Then A = OR, as desired. 


Example Consider 


Then 
w) = [.985599, .119573, 119573] 


—4.24264  —2.12132 —-2.12132 
A,=P,A= 0 3.62132 —.621321 
0 621321 3.62132 


w) = [0, 996393, .0848572]” 


— 4.24264 -—2.12132  —2.12132 

R= P,A,= 0 3.67423 —1.22475 

0 0 3.46410 
For the factorization A = QR, evaluate Q = P,P,. But in most applications, it 
would be inefficient to explicitly produce Q. We comment further on this shortly. 


Since Q orthogonal implies det(Q) = +1, we have 


|det (A) | =|det (Q) det (R)| =|det (R)| = 53.9999 
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This number is consistent with the fact that the eigenvalues of A are A = 3, 3,6, 
and their product is det (A) = 54. 


Discussion of the QR factorization It is useful to know to what extent the 
factorization A = QR is unique. For A nonsingular, suppose 


A=Q,R, = Q,R, (9.3.15) 
Then R, and R, must also be nonsingular, and 


030, or R,R;" 


‘The inverse of an upper triangular matrix is upper triangular, and the product of 


two upper triangular matrices is upper triangular. Thus R,Rj! is upper triangu- 
lar. Also, the product of two orthogonal matrices is orthogonal; thus, the product 


7Q, is orthogonal. This says R,R;! is orthogonal. But it is not hard to show 


that the only upper triangular orthogonal matrices are the diagonal matrices. For 
some diagonal matrix D, 
| R,R;'=D 
Since R,Rj' is orthogonal, 
Di=I 


Since we are only dealing with real matrices, D has diagonal elements equal ‘to 
+1 or —1. Combining these results, 


Q,=Q,D R,=DR, (9.3.16) 


This says the signs of the diagonal elements of R in A = QR can be chosen 
arbitrarily, but then the rest of the decomposition is uniquely determined. 

Another practical matter is deciding how to evaluate the matrix R of (9.3.13). 
Let 


A, = PA, =[1-2wOwT]4,_, or =1,2,...,2-1 (9.3.17) 


with Ay = A, A,_, = R. If we calculate P. and then multiply it times A,_, to 
form A,, the number of multiplications will be 


1 
(n—r+1) + gin rt 2)(n—rt 1) 
There is a much more efficient method for calculating A, Rewrite (9.3.17) as 


A, = A,_, — 20) [wT] (9.3.18) 


First calculate w‘74,_,, and then calculate w?[w‘974,_,] and A,. This re- 
quires about 


An-—r)(n-rt+l+(n—rt+ 1) (9.3.19) 
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multiplications, which shows (9.3.18) is a preferable way to evaluate each A, and 
finally R = A,_,. This does not include the cost of obtaining w‘”, which was 
discussed earlier, following (9.3.10). 

If it is necessary to store the matrices P,,..., P,_, for later use, just store each 
column w”, r = 1,..., — 1. Save the first nonzero element of w”, the one in 
position r, in a special storage location, and save the remaining nonzero elements 
of w‘”, those in positions r + 1 through 7, in column r of the matrix A, and R, 
below the diagonal. The matrix Q of (9.3.14) could be produced explicitly. But as 
the construction (9.3.18) shows, we do not need Q explicitly in order to multiply 
it times some other matrix. 

The main use of the QR factorization of A will be in defining the QR method 
for calculating the eigenvalues of A, which is presented in Section 9.5. The 
factorization can also be used to solve a linear system of equations Ax = b. The 
factorization leads directly to the equivalent system Rx = Q7b, and very little 


error is introduced because Q is orthogonal. The system Rx = Q7b is upper © 


triangular, and it can be solved in a stable manner using back substitution. For A 
an ill-conditioned matrix, this may be a superior way to solve the linear system 
Ax = b. For a discussion of the errors involved in obtaining and using the OR 
factorization and for a comparison of it and Gaussian elimination for solving 
Ax = b, see Wilkinson (1965, pp. 236, 244-249). We pursue this topic further in 
Section 9.7, when we discuss the least squares solution of overdetermined linear 
systems. 


The transformation of a symmetric matrix to tridiagonal form- Let A be a real 
symmetric matrix. To find the eigenvalues of A, it is usually first reduced to 
tridiagonal form by orthogonal similarity transformations. The eigenvalues of the 
tridiagonal matrix are then calculated using the theory of Sturm sequences, 
presented in Section 9.4, or the QR method, presented in Section 9.5. For the 
orthogonal matrices, we use the Householder matrices of (9.3.3). 

Let 


P.=1—2whtDywODT p= 1,...,n-2 (9.3.20) 
with w(t defined as in (9.3.2): 
we [0.4 OW gaan WA]? 


[Note the change in notation from that of the P. of (9.3.12) used in defining the 
QR factorization.] The matrix 


A, = PJAP, = P,AP, 


is similar to A, the element aj, is unchanged, and A, will be symmetric. Produce 
w® and P, to obtain the form 


Pi Ag, = [ay,, G>,,0,...,0]7 


for some @2,. The vector A, is the first column of A. Use (9.3.5)-(9.3.10) with 
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m=n-— 1 and 
T 
d = [ay,, 43;,-.-, 2, ] 


For example, from (9.3.8), 


a = —sign(ay,)¥a3, + --- +a% 


Having obtained P, and P,A, postmultiplication by P, will not change the 
first column of P, A. (This should be checked by the reader.) The symmetry of A, 
follows from 


Al = (P,AP,)" = PIAP? = PAP, = Ay 


Since A, is symmetric, the construction on the first column of A will imply that 
A, has zeros in positions 3 through n of both the first row and column. 
Continue this process, letting 


Aj, = PJA,P,  r=1,2,...,1-2 (9.3.21) 


with A, = A. Pick P, to introduce zeros into positions r+ 2 through a of 
column r. Columns 1 through r—1 will remain undisturbed in calculating 
P_A,_,, due to the special form of P.. Pick the vector w‘"*» in analogy with the 
preceding description for w™. 

The final matrix T = A,_, 1s tridiagonal and symmetric. 


a By 0 ci 0 
By a By : 
0 
oe ee (9.3.22) 
On—-1 B,-1 
0 aes B,-1 a, 


This will be a much more convenient form for the calculation of the eigenvalues 
of A, and the eigenvectors of A can easily be obtained from those of T. 
T is related to A by 


T=Q'™Q Q=P,--- Py (9.3.23) 


As before with the QR factorization, we seldom produce Q. explicitly, preferring 
to work with the individual matrices P. in analogy with (9.3.18). For an 
eigenvector x of A, say Ax = Ax, we have 


Tz=\z x=@Qz (9.3.24) 
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If we produce an orthonormal set of eigenvectors {z;} for T, then {Qz,} will be 
an orthonormal set of eigenvectors for A, since Q preserves length and angles 


(see Problem 13 of Chapter 7). 


Example Let 


Then 
oral 
w= 10, 
5” ¥5 
i 0 0 
Regis 4 
P,= “ss 
‘4 3 
= 5 
1 -5 0 
5 2B 14 
T= PIAP, = 25 25 
4 14 23 
255 


For an error analysis of this reduction to tridiagonal form, we give some 
results from Wilkinson (1965). Let the computer arithmetic be binary floating- 
point with rounding, with ¢ binary digits in the mantissa. Furthermore, assume 
that all inner products 


that occur in the calculation are accumulated in double precision, with rounding 
to single precision at the completion of the summation. These inner products 
occur in a variety of places in the computation of T from A. Let T denote the 
actual symmetric tridiagonal matrix that is computed from A using the preceding 
computer arithmetic. Let P denote the actual matrix produced in converting 
A,.; to A,, let P, be the theoretically exact version of this matrix if no rounding 
errors occurred, and let Q = P, --- P,_, be the exact product of these P, an 
orthogonal matrix. 


Theorem 9.4 Let A be a real symmetric matrix of order n. Let T be the real 
symmetric tridiagonal matrix resulting from applying the House- 
holder similarity transformations (9.3.20) to A, as in (9.3.21). 
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Assume the floating-point arithmetic used has the characteristics 
described in the preceding paragraph. Let {A,;} and {7,} be the 
eigenvalues of A and 7, respectively, arranged in increasing order. 


Then 
n ‘ 1/2 
ey (1, = \;) ; 
421TH] <e,2-' (9.3.25) 
uM 
i=1 
with 


c, = 25(n — 1)[1 + (ids * 


For small and moderate values of n, c,, = 25(n — 1). 


Proof From Wilkinson (1965, p. 161) using the Frobenius matrix norm F, 


F(T — QTAQ) < 2x(n—-1)(+x)" “*F(A) — (9.3.26) 


with x = (12.36)2~‘. From the Wielandt—Hoffman result (9.1.19) of 
Theorem 9.3, we have 


n 1/2 
E (7; = a = FUELS Q7AQ) (9.3.27) 


since A and Q7AQ have the same eigenvalues. And from Problem 28(b) 
of Chapter 7, 


n 1/2 
ray= [Ex] 


Combining these results yields (9.3.25). . | 


For a further discussion of the error, including the case in which inner 


| products are not accumulated in double precision, see Wilkinson (1965, pp. 


297-299). The result (9.3.25) shows that the reduction to tridiagonal form is an 
extremely stable operation, with little new error introduced for the eigenvalues. 


Planar rotation orthogonal matrices There are other classes of orthogonal 
matrices that can be used in place of the Householder matrices. The principal 
class is the set of plane rotations, which can be given the geometric interpretation 
of rotating a pair of coordinate axes through a given angle @ in the plane of the 
axes. For integers k, |, 1 < k <1 <n, define the n x n orthogonal matrix R(* 
by altering four elements of the identity matrix J,,. For any real number @, define 
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the elements of R“” by 


cos6 (i, j)=(k,k) or (I, /) 
RED = sn@ (i, j) = (k,/) 
hy) =sin@ (i, f) = (1, k) 
(J,)i; all other (7, /) 


(9.3.28) 


forl <i, j<n. 
Example For n = 3, 
cos6 0 sin8 


RO3) = 0 
—sin@ 0 cosé@ 


ro 
fon) 


As a particular case, take 9 = 1/4. Then 


1 P 1 
v2 v2 
ROD = 0 1 
1 . 1 
v2 v2 


The plane rotations R“:”) can be used to accomplish the same reductions for 
which the Householder matrices are used. In most situations, the Householder 
matrices are more efficient, but the plane rotations are more efficient for part of 
the QR method of Section 9.5. The idea of solving the symmetric matrix 
eigenvalue problem by first reducing it to tridiagonal form is due to W. Givens in 
1954. He also proposed the use of the techniques of the next section for the 
calculation of the eigenvalues of the tridiagonal matrix. Givens used the plane 
rotations R‘:", and the Householder matrices were introduced in 1958 by A. 
Householder. For additional discussion of rotation matrices and their properties 
and uses, see Golub and Van Loan (1983, sec. 3.4), Parlett (1980, sec. 6.4), 
Wilkinson (1965, p. 131), and Problems 15 and 17(b). 


9.4 The Eigenvalues of a Symmetric Tridiagonal Matrix 


Let T be a real symmetric tridiagonal matrix of order n, as in (9.3.22). We 
compute the characteristic polynomial of T and use it to calculate the eigenvalues 
of T. 

To compute 


f,(A) = det (T — AT) (9.4.1) 
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introduce the sequence 


a,—-A By 0 0 
B, a2 — r B, : 
f(A) = det] 0 = , (9.4.2) 
2 k-1 
0 By-y  % A 


forl <k <n, and f,(A) = 1. By direct evaluation, 
f(A) = a — 
f(A) = (a — A)ay — A) - BF 
= (a, — A) ACA) ~ Bifo(A) 


The formula for f,(A) illustrates the general triple recursion relation that the 
sequence { f,(A)} satisfies: 


LlA) = (ay — A) Fc-1(A) — Bea fe-2) 2<k<n_ (9.4.3) 
To prove this, expand the determinant (9.4.2) in its last row using minors and the 
result will follow easily. This method for evaluating f,(A) will require 2n — 3 


multiplications, once the coefficients {8?} have been evaluated. 


Example Let 


(9.4.4) 


oooorh 
oOo OrFN re 
COOrFrNRF OC 
OrFNF OO 
HN OOO 
NK OOOO 


Then 
So(A) ae | FA) =2-A 
f(A) = (2-A)F-10A) — F204) f= 2,3,4,5,6 (9.4.5) 


Without the triple recursion relation (9.4.5), the evaluation of f,(A) would be 
much more complicated. 


At this point, we might consider the problem as solved since f(A) is a 
polynomial and there are many polynomial rootfinding methods. Or we might 
use a more general method, such as the secant method or Brent’s method, both 
described in Chapter 2. But the sequence { f,(A)|O0 < k <n} has special proper- 
ties that make it a Sturm sequence, and these properties make it comparatively 
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easy to isolate the eigenvalues of 7. Once the eigenvalues have been isolated, a 
method such as Brent’s method [see Section 2.8] can be used to rapidly calculate 
the roots. The theory of Sturm sequences is discussed in Henrici (1974, p. 444), 
but we only consider the special case of { f,(A)}. 

Before stating the consequences of the Sturm theory for { f,(A)}, we consider 
what happens when some 8, = 0. Then the eigenvalue problem can be broken 
apart into two smaller eigenvalue problems of orders / and n — /. As an example, 
consider 


a Bf, 9 O 0 
B, a, 9 O O 
T=|0 0 a BB 90 
0 0 B; a, By 
0 0 0 fy as 


Define 7, and T, as the two blocks along the diagonal, of orders 2 and 3, 
respectively, and then 
rT. T, 0 
{0 fF, 


det [T — XJ,] = det [T, — AL,] det [7% — AL] 


From this, 


and we can find the eigenvalues of T by finding those of 7, and 7;. The 
eigenvector problem also can be solved in the same way. For example, if 
T,% = Ax, with % # Oin R’, define 


x = [%7,0,0,0]” 


Then Tx = Ax. This construction can be used to calculate a complete set of 
eigenvectors for T from those for T, and T,. For the remainder of the section, we 
assume that all 8, # 0 in the matrix 7. Under this assumption, all eigenvalues of 
T will be simple roots of f,(A). 


The Sturm sequence property of { f,(\)} The sequences { f,(a)} and { f,(b)} 
can be used to determine the number of roots of f,(A) that are contained in 
[a, 5}. To do this, introduce the following integer-valued function s(A). Define 
5(A) to be the number of agreements in sign of consecutive members of the 
sequence { f,(A)}, and if the value of some member /;(A) = 0, let its sign be 
chosen opposite to that of f;_,(A). It can be shown that f(A) = 0 implies 


fy-0A) # 0. 


Example Consider the sequence fo(A),..., (A) given in (9.4.5) of the last 
example. For A = 3, 


(fo(A),---> fe(A)) = (1, -1, 0,1, —1,0, 1) 
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The corresponding sequence of signs is 
(+,-,4,4,7,4,4) 
and s(3) = 2. 


We now state the basic result used in computing the roots of f,(A) and thus 
the eigenvalues of 7. The proof follows from the general theory given in Henrici 
(1974). 


Theorem 9.5 Let T bea real symmetric tridiagonal matrix of order n, as given 
in (9.3.22). Let the sequence { f,(A)|O0 < k <n} be defined as in 
(9.4.2), and assume all B, # 0, / = 1,...,n — 1. Then the number 
of roots of f,(A) that are greater than A = az is given by s(a), 
which is defined in the preceding paragraph. For a <b, the 
number of roots in the interval a < \ < b is given by s(a) — s(b). 


Calculation of the eigenvalues Theorem 9.5 will be the basic tool in locating 
and separating the roots of f,(A). To begin, calculate an interval that contains 
the roots. Using the Gerschgorin circle Theorem 9.1, all eigenvalues are contained 
in the interval [a, 5], with 


a= Min {a,— IB;| — \B:-al} 
l<isn 

b= Max {a; + |B] + 1B)-11} 
lsisn 


where £, = B, = 0. 

We use the'bisection method on [a, b] to divide it into smaller subintervals. 
Theorem 9.5 is used to determine how many roots are contained in a subinterval, 
and we seek to obtain subintervals that will each contain one root. If some 
eigenvalues are nearly equal, then we continue subdividing until the root is found 
with sufficient accuracy. Once a subinterval is known to contain a single root, we 


- can switch to a more rapidly convergent method. 


Example Consider further the example (9.4.4). By the Gerschgorin Theorem 
9.1, all eigenvalues lie in [0,4]. And it is easily checked that neither A = 0 nor 
A = 4 is an eigenvalue. A systematic bisection process was carried out on [0, 4] to 
separate the six roots of f,(A) into six separate subintervals. The results are 
shown in Table 9.2 in the order they were calculated. The roots are labeled as 
follows: 


0<A,<A,;< --- <A, <4 


The roots can be found by continuing with the bisection method, although 
Theorem 9.5 is no longer needed. But it would be better to use some other. - 
rootfinding method. : ; 

Although all roots of a tridiagonal matrix may be found by this technique, it is 
generally faster in that case to use the QR algorithm of the next section. With 


v 
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Table 9.2 Example of use of Theorem 9.5 


r f(A) s(A) Comment 
0.0 7.0 6 A, > 0 
4.0 7.0 0 A, <4 
2.0 —1.0 3 Ay<2<); 
1.0 1.0 4 A, <1<A,<2 
5 — 1.421875 5 0<A,<05<A;<1 
3.0 1.0 2 2<A,<3<A, 
3.5 — 1.421875 1 3<A,<35<’, <4 


large matrices, we usually do not want all of the roots, in which case the methods 
of this section are preferable. If we want only certain specific roots, for example, 
the five largest or all roots in a given interval or all roots in [1, 3], then it is easy to 
locate them using Theorem 9.5. 


9.5 The OR Method 


At the present time, the QR method is the most efficient and widely used general 
method for the calculation of all of the eigenvalues of a matrix. The method was 
first published in 1961 by J. G. F. Francis and it has since been the subject of 
intense investigation. The QR method is quite complex in both its theory and 
application, and we are able to give only an introduction to the theory of the 
method. For actual algorithms for both symmetric and nonsymmetric matrices, 
refer to those in EISPACK and Wilkinson and Reinsch (1971). 
Given a matrix A, there is a factorization 


A=QR 


with R upper triangular and Q orthogonal. With A real, both Q and R can be 
chosen real; their construction is given in Section 9.3. We assume A is real 
throughout this section. Let A, = A, and define a sequence of matrices 4,, Q,,. 
and R,, by 


mom 


An = OnR Amsy= Ry,  m=1,2,... (9.5.1) 


Since R,, = Q7.A,,, we have 


m*”m?> 


Anti — O7A,Qm (9.5.2) 


The matrix A,,,, is orthogonally similar to A,,, and thus by induction, to A. 
The sequence {A,,} will converge to either a triangular matrix with the 
eigenvalues of A on its diagonal or to a near-triangular matrix from which the 
eigenvalues can be easily calculated. In this form the convergence is usually slow, 
and a technique known as shifting is used to accelerate the convergence. The 
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technique of shifting will be introduced and illustrated later in the section. 


Example Let 
2 1 0 
Apel: <3y 2 (9.5.3) 
1 


The eigenvalues are 
A, =3+73 =4.7321 A,=3.0 A,=3—- V3 + 1.2679 


The iterates A,, do not converge rapidly, and only a few are given to indicate the 
qualitative behavior of the convergence: 


3.0000 1.0954 0 3.7059 9558 0 
A, = | 1.0954 3.0000 —1.3416 A;=] .9558 3.5214 9738 
0 — 1.3416 3.0000 0 9738 = 1.7727 

4.6792 2979 0 4.7104 1924 0 
A,= | .2979 3.0524 0274 Ag={| .1924 3.0216 —.0115 
0 0274 1.2684 -0 —.0115 (1.2680 

4.7233 1229 0 4.7285 .0781 0 
Ag={| .1229 3.0087 0048 Aj =| .0781 3.0035 — .0020 
0 0048 1.2680 0  #—.0020 1.2680 


The elements in the (1,2) position decrease geometrically with a ratio of about 
.64 per iterate, and those in the (2,3) position decrease with a ratio of about .42 
per iterate. The value in the (3, 3) position of A,, will be 1.2679, which is correct 


to five places. 


The preliminary reduction of A to simpler form The QR method can be 
relatively expensive because the QR factorization is time-consuming when re- 
peated many times. To decrease the expense the matrix is prepared for the QR 
method by reducing it to a simpler form, one for which the QR factorization is 
much less expensive. 

If A is symmetric, it is reduced to a similar symmetric tridiagonal matrix 
exactly as described in Section 9.3. If A is nonsymmetric, it is reduced to a 
similar Hessenberg matrix. A matrix B is Hessenberg if 


b,=0 foralli>j+1 (9.5.4) 


It is upper triangular except for a single nonzero subdiagonal. The matrix A is 
reduced to Hessenberg form using the ‘same algorithm as was used for reducing 
symmetric matrices to tridiagonal form. 

With A tridiagonal or Hessenberg, the Householder matrices of Section 9.3 
take a simple form when calculating the QR factorization. But generally the 
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plane rotations (9.3.28) are used in place of the Householder matrices because 
they are more efficient to compute and apply in this situation. Having produced 
A, = Q,R, and A, = R,Q,, we need to know that the form of A, is the same as 
that of A, in order to continue using the less expensive form of QR factorization. 

Suppose A, is in the Hessenberg form. From Section 9.3 the factorization 
A, = Q,R, has the following value of Q): 


Q,=H,...H,-; (9.5.5) 
with each H, a Householder matrix (9.3.12): 
H,=I-2wwOT Lek<n-1 (9.5.6) 


Because the matrix A, is of Hessenberg form, the vectors w“ can be shown 
to have the special form 


w%)=Q fori<kandi>k+1 (9.5.7) 


This can be shown from the equations for the components of w*), and in 
particular (9.3.10). From (9.5.7), the matrix H, will differ from the identity in 
only the four elements in positions (k, k), (k,k + 1), (k +1, %), and (k + 1, 
k + 1). And from this it is a fairly straightforward computation to show that Q, . 
must be Hessenberg in form. Another necessary lemma is that the product of an 
upper triangular matrix and a Hessenberg matrix is again Hessenberg. Just 
multiply the two forms of matrices, observing the respective patterns of zeros, in 
order to prove this lemma. Combining these results, observing that R, is upper 
triangular, we have that A, = R,Q, must be in Hessenberg form. 

If A, is symmetric and tridiagonal, then it is trivially Hessenberg. From the 
preceding result, A, must also be Hessenberg. But A, is symmetric, since 


At = (Q74,0,)" = QTATQ, = QTA,O, = Ay 


Since any symmetric Hessenberg matrix is tridiagonal, we have shown that A, is — 
tridiagonal. Note that the iterates in the example (9.5.3) illustrate this result. 


Convergence of the QR. method Convergence results for the QR method can be 
found in Golub and Van Loan (1983, secs. 7.5 and 8.2), Parlett (1968), (1980, 
chap. 8), and Wilkinson (1965, chap. 8). The following theorem is taken from the 
latter reference. 


Theorem 9.6 Let A be a real matrix of order n, and let its eigenvalues {A,} 
satisfy 


Ay} > [Ag] -** > JA,) > 0 (9.5.8) 


Then the iterates R,, of the QR method, defined in (9.5.1), will 
converge to an upper triangular matrix D, which contains the 
eigenvalues {A,} in the diagonal positions. If A is symmetric, the 
sequence {A,,} converges to a diagonal matrix. For the speed of 
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convergence, 


i+] 


||D — A,,|| < ¢Max (9.5.9) 


I 


As an example of this error bound, consider the example (9.5.3). In it, the 
ratios of the successive eigenvalues are 


= 42 (9.5.10) 


If any one of the off-diagonal elements of A,, in the example is examined, it will 
be seen to decrease by one of the factors in (9.5.10). 

For matrices whose eigenvalues do not satisfy (9.5.8), the iterates A,, may not 
converge to a triangular matrix. For A symmetric, the sequence {A,,} will 
converge to a block diagonal matrix 


A,, > D= (9.5.11) 


in which all blocks B, have order 1 or 2. Thus the eigenvalues of A can be easily 
computed from those of D. If A is real and nonsymmetric, the situation is more 
complicated, but acceptable. For a discussion, see Wilkinson (1965, chap. 8) and 


Parlett (1968). 
To see that {A,,} does not always converge to a diagonal matrix, consider the 


simple symmetric example 


Its eigenvalues are A = +1. Since A is orthogonal, we have 
A=Q,R, with Q,=A R,=I1 
And thus 
A,=R\Q,=A 


and all iterates A, = A. The sequence {A,,} does not converge to a diagonal 
matrix. 


The QR method with shift The QR algorithm is generally applied with a shift 
of origin for the eigenvalues in order to increase the speed of convergence. For a 
sequence of constants {c,,}, define A, = A and 


Am — ml = OnRm 
An+i sr Cyt + RQm n= 1, Zee! (9.5.12) 
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The matrices A,, are similar to A,, since 
R,, = O7(An — Cnt) 
Ansel + OF As = ent )On 
=c,,1+ Q7A,QOm— mt 
Ams1 = QnAmQm m2i (9.5.13) 
The eigenvalues of A,,,, are the same as those of A,,, and thence the same as 
those of A. 


To be more specific on the choice of shifts {c,,}, we consider only a symmetric 
tridiagonal matrix A. For A,,, let 


af™ pir 20% 0 
BX af Bs” 
A,=| 9 — (9.5.14) 
; por) 
0 eae : od a ie 


There are two methods by which {c,,} is chosen: (1) Let c,, = a{”, and (2) let 
c,, be the eigenvalue of 


(™) (m) 
be (9.5.15) 


Beta” 


which is closest to a‘). The second strategy is preferred, but in either case the 
matrices A,, converge to a block diagonal matrix in which the blocks have order 
1 or 2, as in (9.5.11). It can be shown that either choice of {c,,} ensures 


BLOB +0 as moo (9.5.16) 


generally at a much more rapid rate than with the original QR method (9.5.1). 
From (9.5.13), 


WAmaills = NOmAmQmll2 = lAmll2 


using the operator matrix norm (7.3.19) and Problem 27(c) of Chapter 7. The 
matrices { A,,} are uniformly bounded, and consequently the same is true of their 
_ elements. From (9.5.16) and the uniform boundedness of { 8{”)} and { 8{}}, we 
have either B{") — 0 or B("} — 0 as m — oo. In the former case, a”) converges 
to an eigenvalue of A. And in the latter case, two eigenvalues can easily be 
extracted from the limit of the submatrix (9.5.15). 

Once one or two eigenvalues have been obtained due to 8B”) or B™) being 
essentially zero, the matrix A,, can be reduced in order by one or two rows, 
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respectively. Following this, the QR method with shift can be applied to the 
reduced matrix. The choice of shifts is designed to make the convergence to zero 
be more rapid for B{")8‘"} than for the remaining off-diagonal elements of the 
matrix. In this way, the QR method becomes a rapid general-purpose method, 
faster than any other method at the present time. For a proof of convergence of 
the QR method with shift, see Wilkinson (1968). For a much more complete 
discussion of the QR method, including the choice of a shift, see Parlett (1980, 
chap. 8). 


Example Use the previous example (9.5.3), and use the first method of choosing 
the shift, c,, = a”. The iterates are 


210 , 1.4000 4899 0 
A,=]1 3 4 A, =} .4899 3.2667 .7454 
041 4 0 1454 4.3333 
FE! .2915 .2017 0 1.2737 .9993 0 
A,= | .2017 3.0202 . 2724 A,= | .0993 2.9943 .0072 
0 2724 4.6884 0 .0072 4.7320 
1.2694 0498 0. 
A,= | 0498 2.9986 0 
0 0 4.7321 


The element B{”) converges to zero extremely rapidly, but the element B{”) 
converges to zero geometrically with a ratio of only about .5. 


Mention should be made of the antecedent to the QR method, motivating 
much of it. In 1958, H. Rutishauser introduced an LR method based on the 
Gaussian elimination decomposition of a matrix into a lower triangular matrix 
times an upper triangular matrix. Define 


A,, = LR m A me = Rinbim = L"Amlim 


with L,, lower triangular, R,, upper triangular. When applicable, this method 
will generally be more efficient than the QR method. But the nonorthogonal 
similarity transformations can cause a deterioration of the conditioning of the 
eigenvalues of some nonsymmetric matrices. And generally it is a more com- 
plicated algorithm to implement in an automatic program. A complete discussion 
is given in Wilkinson (1965, chap. 8). 


9.6 The Calculation of Figenvectors and Inverse Iteration 


The most powerful tool for the calculation of the eigenvectors of a matrix is 
inverse iteration, a method attributed to H. Wielandt in 1944. We first define and 
illustrate inverse iteration, and then comment more generally on the calculation 
of eigenvectors. 
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To simplify the analysis, let A be a matrix whose Jordan canonical form is 
diagonal, 


P~'4P = diag{A,,...,A,] (9.6.1) 
Let the columns of P be denoted by x,,..., x,. Then 
Ax,=A,;x;  i=1,...,7 (9.6.2) 


Without loss of generality, it can also be assumed that ||x,{|,, = 1, for all i. 
Let A be an approximation to a simple eigenvalue A, of A. Given an initial 
z©, define {w"} and {z(™} by 


wim) 


(A = AT) wort) ze zt) gr [wry m> 0. (9.6.3) 


This is essentially the power method, with (4 —AJ)"' replacing A in (9.2.2)-(9.2.3); 
and for simplicity in analysis and implementation, we replace £,, by |w°"* "| .. The 
matrix A — AJ is ill-conditioned from the viewpoint of the material in Section 8.4 of 
Chapter 8. But any resulting large perturbations in the solution will be rich in the ei- 
genvector x, of the eigenvalue A, —A for.A — AJ, and this is the vector we desire. For a 
further discussion of this source of instability in solving the linear system, see the ma- 
terial following formula (8.4.8) in Section 8.4. For the method (9.6.3) to work, we do 
not want A—AJ to be singular. Thus A shouldn’t be exactly A,, although it can be 
quite close, as a later example demonstrates. 

For a more precise analysis, let z® be expanded in terms of the eigenvector 
basis of (9.6.2): 


z= > a,x; (9.6.4) 


i=l 


And assume a, # 0. In analogy with formula (9.2.4) for the power method, we 
can show 


om, Ont A ~ MI) 
zm = WA Al) "ZOU, jo,,{ = 1 (9.6.5) 


Using (9.6.4), 


= —™ 70) — : Py soa] * : 
(A-AI) py I ; (9.6.6) 


‘Let A, — A =, and assume 


A;-AJ2e>O f=1,...,n i#k (9.6.7) 
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From (9.6.6) and (9.6.5), 


a; 1 2 
ere) mibesd x; 


hee 
2(™ = g_ inks (9.6.8) 
r a; f 1 i, 
x,te™) — x; 
: ink % LAG A 7 
with |o,,| = 1. If fe] < c, then 
Eel afaL EP Sle] ese 
ee") — x} <]- a 6. 
ink %KLAI—A a CL pak % 


This quantity goes to zero as m — oo. Combined with (9.6.8), this shows z‘” 
converges to a multiple of x, as m — oo. This convergence is linear, with a ratio 
of |¢/c| decrease in the error in each iterate. In practice |e| is quite small and 
this will usually mean |e/c| is also quite small, ensuring rapid convergence. 

In implementing (9.6.3), begin by factoring A — AJ using the LU decomposi- 
tion of Section 8.1 of Chapter 8. To simplify the notation, write 


A-AI=LU 
in which pivoting is not involved. In practice, pivoting would be used. Solve for 
each iterate z‘"*) as follows: 

Lyo"*) = 20m) Uwe th = pint) 
wnt 1) 


Vom Dy 9.6.10 
wr OH re 


zimth — 


Since A — AJ is nearly singular, the last diagonal element of U will be nearly 
zero. If it is exactly zero, then change it to some small number or else change A 
very slightly and recalculate L and U. 

For the initial guess z, Wilkinson (1963, p. 147) suggests using 


z=fe e=[i,1,...,1]7 


Thus in (9.6.10), 

| yXD=e UwMee (9.6.11) 
This choice is intended to ensure that a, is neither nonzero nor small in (9.6.4). 
But even if it were small, the method would usually converge rapidly. For 


example, suppose that some or all of the values a,/a, in (9.6.9) are about 10%. 
And suppose |¢/c| = 107%, a realistic value for many cases. Then the bound in 


(9.6.9) becomes 
(107°) n- 104 


and this will decrease very rapidly as m increases. 
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Example Use the earlier matrix (9.5.8). 
2 1 #0 
A=j1 3 1 (9.6.12) 
0 1 4 
‘Let A = 1.2679 = A,=3- V3, which is accurate to five places. This leads to 


1.0 0 0 .7321 1.0 0 
L=1|1.3659 1.0 0; U=) 0 3662 =1.0 
0 2.7310 1.0 0 0 0011 


subject to the effects of rounding errors. Using y® = [1,1,1]7, 
w) = [3385.2, —2477.3, 908.20] 7 
z = [1.0000, —.73180, .26828]” 
w = [20345, —14894, 5451.9]” 
z® = [1.000, — .73207, .26797]” (9.6.13) 


and the vector of z® = z®. The true answer is - 
x, = [1,1 — ¥3,2- v3|” 
= [1.0000, ~ .73205, .26795]” (9.6.14) 


and z® equals x, to within the limits of rounding error accumulations. 


Eigenvectors for symmetric tridiagonal matrices Let A be a real symmetric 
tridiagonal matrix of order n. As previously, we assume that some or all of its 
eigenvalues have been computed accurately. Inverse iteration is the preferred 
method of calculation for the eigenvectors, and it is quite easy to implement. For 
A an approximate eigenvalue of A, the calculation of the LU decomposition is 
inexpensive in time and storage, even with pivoting. For example, see the 
material on tridiagonal systems in Section 8.3 of Chapter 8. The previous 
numerical example also illustrates the method for tridiagonal matrices. 

Some error results are given as further justification for the use of inverse 
iteration. Suppose that the computer arithmetic is binary floating point with 
rounding and with ¢ digits in the mantissa. In Wilkinson (1963, pp. 143-147) it is 
shown that the computed solution » of 


(A —AI) wt) = 20 
is the exact solution of 
(A-AI+ E)W=z™ (9.6.15) 
with 
El], < Kyn - 27? (9.6.16) 
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for some constant K of order unity. This bound is of the size that would be 
expected from errors of the order of the rounding error. 

If the solution # of (9.6.15) is quite large, then it will be a good approximation 
to an eigenvector of A. To prove this, we begin by introducing 


w 


pee 
|W} 2 
Then 

(m) 

(Cee @ eon ee 

ls 
gem) 
=(A-ANé= -E8+ —— 
PAM ee ne ae 


Using |]z°"], < Vniz|],, < Vn, the residual 7 satisfies 


Vn 

mle = LE, + 
heals 

< K¥n-2‘+ vn (9.6.17) 
Wil, 


which is small if {]#||, is large. To prove that this implies 7 is close to an 
eigenvector of A, we let {x,|i = 1,..., 2} be an orthonormal set of eigenvectors. 


And assume 


I 


A=, #= Ea,x, (ell? = De2?=1 
im] 1 


with A, an isolated eigenvalue of A. Also, suppose 
IA, -AJ2c>0 all i#k 
with 
c> |A,—Al 


With these assumptions, we can now derive a bound for the error in 2. 
Expanding 7 using the eigenvector basis: 


n=A—AZ= D(A; —A)x; 
I 


lint? = Doa?(A,— A)” 
1 


= ¥ a?(A,- A)? > c? ¥ a? 


ivk iwk 


: 1 
DY a? s slinll3 
ixk ¢ 
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which is quite small using (9.6.17). From |[2|| = 1, this implies a, = 1 and 


IZ — a, %4llo = V ee: a; < “lal (9.6.18) 
itk 


showing the desired result. For a further discussion of the error, see Wilkinson 
(1963, pp. 142-146) and (1965, pp. 321-330). 

Another method for calculating eigenvectors would appear to be the direct 
solution of 


(A —AI)x = 0 


after deleting one equation and setting one of the unknown components to a 
nonzero constant, for example x, = 1. This is often the procedure used in 
undergraduate linear algebra courses. But as a general numerical method, it can 
be disastrous. A complete discussion of this problem is given in Wilkinson (1965, 
pp. 315-321), including an excellent example. We just use the previous example 
to show that the results need not be as good as those obtained with inverse 
iteration. 


Example Consider the preceding example (9.6.12) with A = 1.2679. We con- 
sider (A — AI)x = 0 and delete the last equation to obtain 


.7321x, + x, =0 
x, + 1.7321x, + x3 =0 
Taking x, = 1.0, we have the approximate eigenvector ~ 
x = [1.0000, —.73210, .26807] 


Compared with the true answer (9.6.14), this is a slightly poorer result than 
(9.6.13) obtained by inverse iteration. In general, the results of using this 
approach can be very poor, and great care must be taken when using it. 


The inverse iteration method requires a great deal of care in its implementa- 
tion. For dealing with a particular matrix, any difficulties can be dealt with on an 
ad hoc basis. But for a general computer program we have to deal with 
eigenvalues that are multiple or close together, which can cause some difficulty if 
not dealt with carefully. For nonsymmetric matrices whose Jordan canonical 
form is not diagonal, there are additional difficulties in selecting a correct basis of 
eigenvectors. The best reference for this topic is Wilkinson (1965). Also see 
Golub and Van Loan (1983, pp. 238-240) and Parlett (1980, pp. 62-69). For 
several excellent programs, see Wilkinson and Reinsch (1971, pp. 418-439) and 
Garbow et al. (1977). 


9.7 Least Squares Solution of Linear Systems 


We now consider the solution of overdetermined systems of linear equations 


Lia,jx,=b; i=1,...,m (9.7.1) 
j=l 
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with m > n. These systems arise in a variety of applications, with the best known 
being the fitting of functions to a set of data {(t;, b,)|i = 1,..., m}, about which 
we say more later. It might appear that the logical place for considering such 
systems would be in Chapter 8, but some of the tools used in the solution of 
(9.7.1) involve the orthogonal transformations introduced in this chapter. The 
numerical solution of (9.7.1) can be quite involved, both theoretically and 
practically, and we give only some major highlights of the subject. 

An overdetermined system (9.7.1) will generally not have a solution. For that 
reason, we seek a vector x = (x,,-..,x,) that solves (9.7.1) approximately in 


“some sense. Introduce 


A=[a,,] Keene b= [b,,..., bn)” 
with Am Xn. Then (9.7.1) can be written as 
. Ax =b (9.7.2) 


For simplicity, assume A and b are real. Among the possible ways of finding an 
approximate solution, we can seek a vector x that minimizes 


Ax — Bll, (9.7.3) 


for some p, 1 <p < oo. In this section, only the classical case of p = 2 is 
ais although in recent years, much work has also been done for the cases 
= land p= oo. 
* teh solution x* of 


Minimize||Ax — bl], — (9.7.4) 
xeER’ 


is called the least squares solution of the linear system Ax = b. There are a 
number of reasons for this approach to solving Ax = b. First, it is easier to 
develop the theory and the practical solution techniques for minimizing 
\|Ax — bl|,, partly because it is a continuously differentiable function of 
X,,---,X,- Second, the curve fitting problems that lead to systems (9.7.1) often 
have a statistical framework that leads to (9.7.4), in preference to munene 
[Ax — bl, with some p # 2. 

To better understand the nature of the solution of (9.7.4), we give the 
following theoretical construction. It also can be used as a practical numerical 
approach, although there are usually other more efficient constructions. Crucial 
to the theory is the singular value decomposition (SVD) 


HB, 


VIAU = F= ., (9.7.5) 


(os) 
fo] 
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The matrices U and V are orthogonal, and the singular values p, satisfy 
Hy 2e.2 °° Su, > 0 


See Theorem 7.5 in Chapter 7 for more information; and later in this section, we 
describe a way to construct the SVD of A. 


Theorem 9.7. Let A be real and mXn, m>n. Define z= U'x, c= V7. 
Then the solution x* = Uz* of (9.7.4) is given by 


Cc; 
zw=— i=1,...,r (9.7.6) 


with Z,,,,---,2Z, arbitrary. When r=n, x* is unique. When 
r <n, the solution of (9.7.4) of minimal Euclidean norm is ob- 
tained by setting 


z*=0 i=rti,...,n (9.7.7) 
[This is also called the least squares solution of (9.7.4), even 


though it is not the unique minimizer of ||Ax — 5]|,.] The mini- 
mum in (9.7.4) is given by 


j=rtl 


ii 1/2 
|Ax* — bij, = Pe a (9.7.8) 


Proof Recall Problem 13(a) of Chapter 7. For any x © R” and any orthogonal 
matrix P, 


Pxll2 = Ill2 
Applying this to || Ax — dj], and using (9.7.5), 
[|x ~ Bll, = |V7Ax — V7\|, = ||V7AUUTx — ll 


= |[Fz — cll. 


for a 1/2 
= X (;2;- ¢;) + = 7] (9.7.9) 


jel j=rtl 


Then (9.7.6) and (9.7.8) follow immediately. For (9.7.7), use 
r ; n 1/2 
lx*lle = llzll2 = |z (y+ 2 sj 
pal fart 


with z,,,,..., 2, arbitrary according to (9.7.9). Choosing (9.7.7) leads to 
a unique minimum value for |{x*||,. a 
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Define the n X m matrix 


By 0 Aeae 0 
7 -1 ‘ 
Fr=|: Br ee (9.7.10) 
0 
0 0 0 
and 
At= UFtVT (9.7.11) 


Looking at (9.7.6)-(9.7.8), 
x* = Uz* = UFtc = UF*V"b 
x*=A*b (9.7.12) 


The matrix A* is called the generalized inverse of A, and it yields the least 
squares solution of Ax = 6. The formula (9.7.12) shows that x* depends linearly 
on b. This representation of x* is an important tool in studying the numerical 
solution of Ax = b. Some further properties of A* are left to Problems 27 and 
28. 

To simplify the remaining development of methods for finding x* and 
analyzing its stability, we restrict A to having full rank, r = n. This is the most 
important case for applications. For the singular values of A, 


Hy =H, > 7 Sp, > 0 (9.7.13) 
The concept of matrix norm can be generalized to A, from that given for 


square matrices in Section 7.3. Define 


|Axll (9.7.14) 


Ajj = Supremum 
ee ee le 


It can be shown, using the SVD of A, that 


Al] = yro(A7A) = py (9.7.15) 


In analogy with the error analysis in Section 8.4, define a condition number for 
Ax = bby 


cond (A), = |JAI|I14*Il = (9.7.16) 


n 


Using this notation, we give a stability result from Golub and Van Loan (1983, 
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p. 141). It is the analogue of Theorem 8.4, for the perturbation analysis of square 
nonsingular linear systems. 
Let b + 5b and A + 6A be perturbations of b and A, respectively. Define 


*=A*b &* = (A + 5A)* (b + 8b) 


r=b6— Ax* r= (b + 6b) - (A + 5A)x* (9.7.17) 
Assume 
{[5A]| al 1 
€ = Max|——, < ——_— (9.7.18) 
| All l5Hl2 cond (A), 
and 
~ Wrile 
sin(@) = —— <1 9.7.19 
( Pll. ( 


implicitly defining 6, 0 < @ < 7/2. Then 


|<* —.x* |]. SS 
———. < ej} ————_ 


I + tan6 [cond (4, + O(e*) (9.7.20) 


cos 6 


ll? — rlia 


ns e[1 + 2cond(A),] Min {1, m—1n} + O(e?) (9.7.21) 
2 
For the case m =n with rank(A) =n, the residual r will be zero, and then 
(9.7.20) will reduce to the earlier Theorem 8.4. = 

The preceding results say that the change in r can be quite small, while the 
change in x* can be quite large. Note that the bound in (9.7.20) depends on the 
square of cond(A)., as compared to the dependence on cond(A) for 
the nonsingular case with m = n [see (8.4.18)]. If the columns of A are nearly 
dependent, then cond(A), can be very large, resulting in a larger bound in 
(9.7.20) than in (9.7.21) [see Problem 34(a)}. Whether this is acceptable or not will 
depend on the problem, on whether one wants small values of r or accurate 
values of x*. 


The least squares data-fitting problem The origin of most overdetermined linear 
systems is that of fitting data by a function from a prescribed family of functions. 
Let {(t,, ;)|/ = 1,..., m} be a given set of data, presumably representing some 
function b = g(t). Let ,(t),...,9,(7) be given functions,-and let # be the 
family of all linear combinations of @,,..., 9, 


am | y x9; (1) 


jul 


x,€ R} (9.7.22) 
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We want to choose an element of ¥ to approximately fit the given data: 


Ee x,9,(t;) = i=1,...,m (9.7.23) 


This is the system (9.7.1), with a;; = @,(t,). 
For statistical modeling reasons, we seek to minimize 


i= 


: m n 24172 
E(x) = e > |. - y si9()| | (9.7.24) 


hence the description of fitting data in the sense of least squares. The quantity 
E(x*), for which E(x) is a minimum, is called the root-mean-square error in the 
approximation of the data by the function 


s(t) = Expos) (9.7.25) 
j=l 
Using earlier notation, 


E(x) = =I - Axll2 


and minimizing E(x) is equivalent to finding the least squares solution of 


(9.7.23). 
Forming the partial derivatives of (9.7.24) with respect to each x,, and setting 


these equal to zero, we obtain the system of equations 
ATAx = ATb (9.7.26) 


This system is'a necessary condition for any minimizer of E(x), and it can also 
be shown to be sufficient. The system (9.7.26) is called the normal equation for the 
least squares problem. If 4 has rank n, then A7A will be n X n and nonsingular, 
and (9.7.26) has a unique solution. 

To establish the equivalency of (9.7.26) with the earlier solution of the least 
squares problem, we use the SVD, of A to convert (9.7.26) to a simpler form. 
Subsritulans A = VFU™ into (9.7. 26), 


UFTFU"x = UFTV"b 
Multiply by U, and use the earlier notation z = U'x, c = Vb. Then 
F'Fz = F% 


This gives a complete mathematical equivalence of the normal equation to the 
earlier minimization of ||Ax — bj], given in Theorem 9.7. 

Assuming that rank(A) = 7, the solution x* can be found by solving the 
normal equation. Since A74 is symmetric and positive definite, the Cholesky 
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decomposition can be used for the solution [see (8.3.8)—(8.3.17)]. The effect on x* 
of rounding errors will be proportional to both the unit round of the computer 
and to the condition number of A7A. From the SVD of 4A, this is easily seen to be 


cond (ATA), = 4 = [cond (A)3]” (9.7.27) 


Thus the sensitivity of x* to errors will be proportional to [cond (A),]*, which is 
consistent with the earlier perturbation error bound (9.7.19). 

The result (9.7.27) used to be cited as the main reason for avoiding the use of 
the normal equation for solving the least squares problem. This is still good 
advice, but the reasons are more subtle. From (9.7.19), if |lrllz is nearly zero, then 
sin 6 = 0, and the bound will be proportional to cond(A),. In contrast, the error 
bound for Cholesky’s method will feature [cond(A),]*, which is larger when 
cond (A), is large. A second reason occurs when A has columns that are nearly 
dependent. The use of finite computer arithmetic can then lead to an approxi- 
mate normal equation that has lost vital information present in A. In such case, 
ATA will be nearly singular, and solution of the normal equation will yield much 
less accuracy in x* than will some other methods that work directly with Ax = b. 
For a more extensive discussion of this, see Lawson and Hanson (1974, pp. 
126-129). 


Example Consider the data in Table 9.3 and its plot in Figure 9.2. We use a 
cubic polynomial to fit these data, and thus are led to minimizing the expression 


1m 4 21/2 
E(x) = . X f - bail} | 


This yields the overdetermined linear system 


4 
Lxtft=b i=1,....m (9.7.28) 
jul 


Table 93 Data for a cubic least squares fit 
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Jy 


Figure 9.2. Plot of data of Table 9.3. 


and the normal equations 
n m . mn 
y x] x a = Yt k=1,2,3,4 
jel inl i-1 
Wiiting this in the form (9.7.26), 
21 10.5 7.175 5.5125 


7, _ | 105 7175 5.5125 4.51666 
AA=| 7175 5.5125 4.51666 3.85416 
5.5125 4.51666 3.85416 3.38212 
ATb = [24.1180, 13.2345, 9.468365, 7.5594405]™ (9.7.29) 


The solution is 


x* = [.5747, 4.7259, — 11.1282, 7.6687]” (9.7.30) 


_ This solution is very sensitive to changes in b. This can be inferred from the 


condition number 
cond (A7A) = 12105 (9.7.31) 


As a further indication, perturb the right-hand vector A7b by adding it to the 
vector 


[.01, —.01, .01, —.01]” 


This is consistent with the size of the errors present in the data values 5,.. With 


. this new right side, the normal equation has the perturbed solution 


£* = [.7408, 2.6825, — 6.1538, 4.4550] 7 


which differs significantly from x*. 
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040 01 02 03 04 05 06 07 08 09 1 
Figure 9.3. The least squares fit g*(f). 


The plot of the least squares fit 
gt(t) = xk + oxkt + xe? + x83 
is shown in Figure 9.3, together with the data. Its root mean square error is 
E(x*) = .0421 


The columns in the example matrix A for the preceding example has columns 
that are almost linearly dependent, and A74 has a large condition number. To 
improve on this, we can choose a better set of basis functions {@,(7)} for the 
family ¥, of polynomials of degree < 3. Examining the coefficients of A7A, 


[AA] x= Lote) ls<ik<n (9.7.32) 
i=1 


If the points {7;} are well distributed throughout the interval [a, b], then the 
previous sum, when multiplied by (b — a)/m, is an approximation to 


Joule) o,(e) ae 


To obtain a matrix A7A that has a smaller condition number, choose functions 
g(t) that are orthonormal. Then A7A will approximate the identity, and A will 
have approximately orthonormal columns, leading to condition numbers close to 
1. In fact, all that is really important is that the family {;(t)} be orthogonal, 
since then the matrix A7A will be nearly diagonal, a well-conditioned matrix. 


Example We repeat the preceding example, using the Legendre polynomials 


that are orthonormal over [0, 1]. The first four orthonormal Legendre polynomials 
on [0, 1] are 


Po(t)=1 9,(t)=V3s —,(t) = Bos = 1) 9;(t) - 60 — 3s) 


(9.7.33) 
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with s = 2r— 1,0 <1? < 1. For the normal equation (9.7.26), 


21.0000 0 2.3479 0 

Ald = 0 23.1000 0 5.1164 
2.3479 0 25.4993 0 

0 5.1164 0 28.3889 


ATb = [24.1180, 4.0721, 3.4015, 4.8519]” 
x* = [1.1454, 1442, .0279, .1449]” (9.7.34) 
The condition number of A74 is now 
cond (A7A) = 1.58 (9.7.35) 
much less than earlier in (9.7.31). 


The QR method of solution Recall the QR factorization of Section 9.3, 
following (9.3.11). As there, we consider Householder matrices of order m X m 


P=I-2wOwOT F=1,...,n 


to reduce to zero the elements below the diagonal in A. The orthogonal matrices 
P, are applied in succession, to reduce to zero the elements below the diagonal in 
columns 1 through n of A. The vector w has nonzero elements in positions j 
through m. This process leads to a matrix 


R=P,--- P,A=Q"A (9.7.36) 


If these are also applied to the right side in the system Ax = b, we obtain the 
equivalent system 


Rx = QO'b (9.7.37) 
The matrix R has the form 
R= ba | (9.7.38) 


with R, an upper triangular square matrix of order n X n. The matrix R, must 
be nonsingular, since A and R = Q7A have the same rank, namely n. In line with 


(9.7.38), write 
7b = Bl eR" g,eR"™ 
2 81 $2 


Then 
Ax — bl]z = ]Q7Ax — QTbI\, = ||Rx — Q7bll2 


1/2 
= [Rix — gill3 + Hgall3}’” 
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The least squares solution of Ax = b is obtained by solving the nonsingular 
upper triangular system 


Rix = 2 (9.7.39) 
Then the minimum is 


|Ax* — Bll. = [gall2 (9.7.40) 


The QR method for calculating x* is slightly more expensive in operations. 
than the Cholesky method. The Cholesky method, including the formation of 
ATA, has an operation count (multiplications and divisions) of about 


1 ee 
=mn? + — 


2 6 
and the Householder QR method has an operation count of about 


2 n° 
mn“ + — 


3 

Nonetheless, the QR method is generally the recommended method for calculat- 
ing the least squares solution. It works directly on the matrix A, and because of 
that and the use of orthogonal transformations, the effect of rounding errors is 
better than with the use of the Cholesky factorization to solve the normal 
equation. For a thorough discussion, see Golub and Van Loan (1983, pp. 
147-149) and Lawson and Hanson (1974, chap. 16). 
Example We consider the earlier example of the linear system (9.7.28) with 

A,=[t7]  1sis2l, 1<j<4 
for the data in Table 9.3. Then cond(A) = 110.01, 


—4.5826 —2.2913  -—1.5657 —1.2029 


Re 0 . 1.3874 1.3874 1.2688 
1” 0 0 — .3744 —.5617 
0 0 0 — 0987 


[—5.2630, .8472, —.1403, — .7566]” 


81 


The solution x* is the same as in (9.7.30), as is the root-mean-square error. 


The singular value decomposition The SVD is a very valuable tool for analyzing. 
and solving least squares problems and other problems of linear algebra. For 
least squares problems of less than full rank, the QR method just described will 
probably lead to a triangular matrix R, that is nonsingular, but has some very 
small diagonal elements. The SVD of A can then be quite useful in making 
clearer the structure of A. If some singular values 2; are nearly zero, then the 
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effect of setting them to zero can be determined more easily than with some other 
methods for solving for x*. Thus there is ample justification for finding efficient 
ways to calculate the SVD of A. 

One of the best known ways to calculate the SVD of A is due tq G. Golub, C. 
Reinsch, and W. Kahan, and a complete discussion of it is given in Golub and 
Van Loan (1983, sec. 6.5). We instead merely show how the singular value 
decomposition in (9.7.5) can be obtained from the solution of a symmetric matrix 
eigenvalue problem together with a QR factorization. 

From A real and m Xn, m>n, we have that A7A is n Xn and real. In 
addition, it is straightforward to show that A7A is symmetric and positive 
semidefinite [x7474x] > 0 for all x]. Using a program to solve the symmetric 
eigenvalue problem, find a diagonal matrix D and an orthogonal matrix U for 
which 


U'(ATA)U = D (9.7.41) 


Let D = diag{A,,...,A,] with the eigenvalues arranged in descending order. If 

any A, is a small negative number, then set it to zero, since all eigenvalues of A7A4 

should be nonnegative except for possible perturbations due to rounding errors. 
From (9.7.41), define B = AU, of order m X n. Then (9.7.41) implies 


BTB =D 


Then the columns-of B are orthogonal. Moreover, if some A, = 0, then the 
corresponding column of B must be identically zero, because its norm is zero. 
Using the OR method, calculate an orthogonal matrix V for which 


V'B=R (9.7.42) 
is zero below the diagonal in all columns. The matrix R satisfies 
R™R = BTV7VB = B™B = D 


Again, the columns of R must be orthogonal, and if some A, = 0, then the 
corresponding column of R must be zero. Since R is upper triangular, we can use 
the orthogonality to show that the columns’ of R must be zero in all positions 
above the diagonal. Thus R has the form of the matrix F of (9.7.5). We will then have 
R = Fwith p, = V,. Letting B = AU in (9.7.42), we have the desired SVD: 


VIAU=R 


One of the possible disadvantages of this procedure is.A74 must be formed, 
and this may lead to a loss of information due to the use of finite-length 
coinputer arithmetic. But the method is simple to implement, if the symmetric 
eigenvalue problem is solvable. 


' Example Consider again the matrix A of (9.7.28), based on the data of Table 


9.3. The matrix ATA is given in (9.7.29). Using EISPACK and LINPACK 
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programs, 


.7827 5963 —.1764 0256 
4533 — .3596 7489 = — 3231 
3326 —.4998 —.0989 .7936 
.2670 —.5150 —.6311 —.5150 


The singular values are 


32.0102, 3.8935, .1674, .0026 


The matrix V is orthogonal and of order 21 X 21, and we omit it for obvious 
reasons. In practice it would not be computed, since it is a product of four 
Householder matrices, which can be stored in a simpler form. 


For a much more extensive discussion of the solution of least squares 
problems, see Golub and Van Loan (1983, chap. 6) and the book by Lawson and 
Hanson (1974). There are many additional practical problemis that must be 
discussed, including that of determining the rank of a matrix when rounding 
error causes it to falsely have full rank. For programs, see the appendix to 
Lawson and Hanson (1974) and LINPACK. For the SVD, see LINPACK or 
EISPACK. 


Discussion of the Literature 


The main source of information for this chapter was the well-known and 
encyclopedic book of Wilkinson (1965). Other sources were Golub and Van Loan 
(1983), Gourlay and Watson (1976), Householder (1964), Noble (1969, chaps. 
9-12), Parlett (1980), Stewart (1973), and Wilkinson (1963). For matrices of 
moderate size, the numerical solution of the eigenvalue problem is fairly well 
understood. For another perspective on the QR method, see Watkins (1982), and 
for an in-depth look at inverse iteration, see Peters and Wilkinson (1979). 
Excellent algorithms for most eigenvalue problems are given in Wilkinson and 
Reinsch (1971) and the EISPACK guides by Smith et al. (1976), and Garbow 
et al. (1977). For a history of the EISPACK project, see Dongarra and Moler 
(1984). An excellent general account of the problems of developing mathematical 
software for eigenvalue problems and other matrix problems is given in Rice 
(1981). The EISPACK package is the basis for most of the eigenvalue programs 


in the IMSL and NAG libraries. 


A number of problems and numerical methods have not been discussed in this 
chapter, often for reasons of space. For the symmetric eigenvalue problem, the 
Jacobi method has been omitted. It is an elegant and rapidly convergent method 
for computing all of the eigenvalues of a symmetric matrix, and it is relatively 
easy to program. For a description of the Jacobi method, see Golub and Van 
Loan (1983, sec. 8.4), Parlett (1980, chap. 9), and Wilkinson (1965, pp. 266-282). 
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An ALGOL program is given in Wilkinson and Reinsch (1971, pp. 202-211). 
The generalized eigenvalue problem, Ax = A Bx, has also been omitted. This has 
become an important problem in recent years. The most popular method for its 
solution is due to Moler and Stewart (1973), and other descriptions of the 
problem and its solution are given in Golub and Van Loan (1983, secs. 7.7 and 
8.6) and Parlett (1980, chap. 15). EISPACK programs for the generalized 
eigenvalue problem are given in Garbow et al. (1977). 

The problem of finding the eigenvalues and eigenvectors of large sparse 
matrices is an active area of research. When the matrices have large order (e.g., 
n > 300), most of the methods of this chapter are more difficult to apply because 
of storage considerations. In addition, the methods often do not take special 
account of the sparseness of most large matrices that occur in practice. One 
common form of problem involves a symmetric banded matrix. Programs for this 
problem are given in Wilkinson and Reinsch (1971, pp. 266-283) and Garbow et 
al. (1977). For more general discussions of the eigenvalue problem for sparse 
matrices, see Jennings (1985) and Pissanetzky (1984, chap. 6). For a discussion of 
software for the eigenvalue problem for sparse matrices, see Duff (1984, pp. 
179-182) and Heath (1982). An important method for the solution of the 
eigenvalue problem for sparse symmetric matrices is the Lanczos method. For a 
discussion of it, see Scott (1981) and the very extensive books and programs of 
Cullum and Willoughby (1984, 1985). 

The least squares solution of overdetermined linear systems is a very im- 
portant tool, one that is very widely used in the physical, biological, and social 
sciences. We have just introduced some aspects of the subject, showing the 
crucial role of the singular value decomposition. A very comprehensive introduc- 
tion to the least squares solution of linear systems is given in Lawson and 
Hanson (1974). It gives a complete treatment of the theory, the practical 
implementation of methods, and ways for handling large data sets efficiently. In 
addition, the book contains a complete set of programs for solving a variety of 
least squares problems. For other references to the least squares solutions of 


. linear systems, see Golub and Van Loan (1983, chap. 6) and Rice (1981, chap. 


11). Programs for some least squares problems are also given in LINPACK. 

In discussing the least squares solution of overdetermined systems of linear 
equations, we have avoided any discussion of the statistical aspect of the subject. 
Partly this was for reasons of space, and partly it was a mistrust of using the 
statistical justification, since it often depends on assumptions about the distribu- 
tion of the error that are difficult to validate. We refer the reader to any of the 
many statistics textbooks for a development of the statistical framework for the 
least squares method for curve fitting of data. - 
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Problems 


1. Use the Gerschgorin theorem 9.1 to determine the approximate location of 
the eigenvalues of 


1 -1 0 -2 11 
(a) 1 3673 (b) ¥ 34 
a? ee ae io Se 


Where possible, use these results to infer whether the eigenvalues are real or 
complex. To check these results, compute the eigenvalues directly by 
finding the roots of the characteristic polynomial. 


2. (a) Given a polynomial 
P(A) =a, Nh + os ag 


show p(A) = det[AJ — A] for the matrix 
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The roots of p(A) are the eigenvalues of A. The matrix A is called the 
companion matrix for the polynomial p(A). 


(b) Apply the Gerschgorin theorem 9.1 to obtain the following bounds 
for the roots r of p(A): |r| < lor |r + a,_1| < Jao] + --- +]a,_,I. 
If these bounds give disjoint regions in the complex plane, what can 
be said about the number of roots within each region. 


(c) Use the Gerschgorin theorem on the columns of A to obtain ad- 
ditional bounds for the roots of p(A). 


(d) Use the results of parts (b) and (c) to bound the roots of the following 
polynomial equations: 


(i) °+ 8+4+1=0 
(ii) AS — 4 +A —-¥B4+H-A41=0 


Recall the linear system (8.8.5) of Chapter 8, which arises when numerically 
solving Poisson’s equation. If the equations are ordered in the manner 
described in (8.8.12) and following, then the linear system is symmetric 
with positive diagonal elements. For the Gauss—Séidel iteration method in 
(8.8.12) to converge, it is necessary and sufficient that A be positive definite, 
according to Theorem 8.7. Use the Gerschgorin theorem 9.1 to prove A is 
positive definite. It will also be necessary to quote Theorem 8.8, that A = 0 
is not an eigenvalue of A. 


The values A = — 8.02861 and 
x = [1.0, 2.50146, —.75773, —2.56421] 


are an approximate eigenvalue and eigenvector for the matrix 


2 1 3 4 
JlG Sea 2h). 35 
ASls G6 2 
a oe: ae | 


Use the result (9.1.22) to compute an error bound for A. 


For the matrix example (9.1.17) with « = .001 and A = 2, compute the 
perturbation error bound (9.1.36). The same bound was given in (9.1.38) for 
the other eigenvalue A = 1. 


Prove the eigenvector perturbation result (9.1.41). Hint: Assume \,(¢€) and 
u,(€) are continuously differentiable functions of e. From (9.1.32), A‘,(0) = 
vy Bu,/s,. Write 


ule) = u,(0) + €u,(0) + Oe") 
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and solve for uj,(0). Since {u,,..., u,} is a basis, write 
nn 
u,(0) = Yo a,u, 
j=l 


To find a, first differentiate (9.1.40) with respect to €, and then let « = 0. 
Substitute the previous meeeuee for ui,(0). Use (9.1.29) and the 
biorthogonality relation 


from (9.1.28). 


For the following matrices A(e), determine the eigenvalues and eigenvec- 
tors for both « = 0 and e > 0. Observe the behavior as « — 0. 


1 1 0 

1 1 1 1 1 ie 
@®P? 3] wf} 2] off | @ 0 1 | 
0 ¢€ 1 
What do these examples say about the stability of eigenvector subspaces? 


Use the power method to calculate the dominant eigenvalue and associated 
eigenvector for the following matrices. 


6 4 AOA a 2. a 
461 4 j =3° 4 5 
@) 14164 oO) 130057 «6 -2 
14 4 6 ae ce | 


Check the speed of convergence, calculating the ratios R,, of (9.2.14). When 
the ratios R,, are fairly constant, use Aitken extrapolation to improve the 
speed of convergence of both the eigenvalue and eigenvector, using the 


eigenvalue ratios R,, to accelerate the eigenvectors { z‘”)}. 


Use the power method to find the dominant eigenvalue of 


Use the initial guess z® = [1,0,1]7. Print each iterate z°”) and A”. 
Comment on the results. What would happen if A{”) were defined by 


M™ = a,,? 


10. 


11. 


12. 


13. 
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For a matrix A of order n, assume its Jordan canonical form is diagonal 
and denote the eigenvalues by Aj,..., A, Assume that A, =A, = --- =A), 
for some r > 1, and 


AAA > Ayal 2 +++ 2 [A] 2 0 


Show that the power method (9.2.2)-(9.2.3) will still converge to A, and 
some associated eigenvector, for most choices of initial vector z. 


Let A be a symmetric matrix of order n, with the eigenvalues ordered by 
A, 2A,2 22% >A, 
Define 


7 (Ax, x) x) 


x #0 xER’, 
using the standard inner product. Show 


Max @(x) =A, Min&(x)=A 


as x * 0 ranges over R”. The function #(x) is called the Rayleigh quotient, 
and it can be used to characterize the remaining eigenvalues of A, in 
addition to A, and A,. Using these maximizations and minimizations for 
&(x) forms the basis of some classical numerical methods for calculating 
the eigenvalues of A. 


To give a geometric meaning to the n Xn Householder matrix P = 
I—2ww7, let u,...,u@ be an orthonormal basis of the (n — 1) 
dimensional subspace that is perpendicular to w. Define 

T(x) = (1-2ww?)x = xER’ 


Use the basis {w, u™,..., u(} for R” to write 


x = aw + a,u®.+ --- +a,u™ 
Apply T to this representation and interpret the results. 


(a) Let A be a symmetric matrix, and let A and x be an eigenvalue— 
eigenvector pair for A with ||x|], = 1. Let P be an orthogonal matrix 
for which 


Px =e, =[1,0,...,0]7 


Consider the similar matrix B = PAPT, and show that the first row 
and column are zero except for the diagonal element, which equals a 
Hint: Calculate and use Be,. 
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(b) For the matrix 


2 10 | 

A={10 5 -8 

2 -8 iil 

A =9 is an eigenvalue with associated eigenvector x = [2, 4, 3]. 
Produce a Householder matrix P for which Px =e,, and then 
produce B = PAP’. The matrix eigenvalue problem for B can then be 
reduced easily to a problem for a 2 X 2 matrix. Use this procedure to 
calculate the remaining eigenvalues and eigenvectors of A. The pro- 
cess of changing A to B and of then solving a matrix eigenvalue 
problem of order one less than for A, is known as defiation. It can be 
used to extend the applicability of the power method to other than the 
dominant eigenvalue. For an extensive discussion, see Wilkinson 
(1965, pp. 584-598) and Parlett (1980, chap. 5). 


14. Use Householder matrices to produce the QR factorization of 


ie |  @ 2 
CE, als 0 as | QO) ft 2 3 
di 2 


15. Consider the rotation matrix of order n, 


10 0 we 0 
01 0 
0 a 0 B O 0| -rowk 
RED = 0 1 0 
-p 0 = lef o{ row! 
1 


with a? + B? = 1. If we compute Rb for a given b € R", then the only 
elements that will be changed are in positions k and /. By choosing a and B 
suitably, we can force Rb to have a zero in position /. Choose a, 8 so that 


[ee allel b] 


(a) Derive formulas for a, 8, and show y = /bz + b?. 


for some y. 


16. 


17. 


18. 


19. 
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(b) Reduce b = [1,1,1,1]7 to a form B= [c,0,0,0] by a sequence of 
multiplications by rotation matrices: 


b= RODRADRAAD 


Show how the rotation matrices R“” can be used to produce the QR 
factorization of a matrix. 


(a) Do an operations count for producing the QR factorization of a 
matrix using Householder matrices, as in Section 9.3. As usual, 
combine multiplications and divisions, and keep a separate count for 
the number of square roots. 


(b) Repeat part (a), but use the rotation matrices R‘*:” for the reduction. 
Give the explicit formulas for the calculation of the QR factorization of a 
symmetric tridiagonal matrix. Do an operations count, and compare the 


result with those of Problem 17. 


Use Theorem 9.5 to separate the roots of 


01000 1200 0 
11100 22 3 0 0 
(az) |0O 1 1 21 °0 b) |0 3 3 4 0 
001141 004 4 5 
000i1 2 000 5 5 


Then obtain accurate approximations using the bisection method or some 
other rootfinding technique. 


(a) Write a program to reduce a symmetric matrix to tridiagonal form 
using Householder matrices for the similarity transformations. For 
efficiency in the matrix multiplications, use the analogue of the form 
of multiplication shown in (9.3.18). 


(b) Use the program to reduce the following matrices to tridiagonal form: 


aay os eet 
@ {2 3 5 @) |; 142 
3 5 8 i 

112 4 


4 6 242 12 

6 225 3 18 
242 3 2 6 
12. 18 6 0 


(c) Calculate the eigenvalues of your reduced tridiagonal matrix as accu- 
rately as possible. 


22. 


25. 


21. 
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Let { p,{x)]n = 0} denote a family of orthogonal polynomials with respect 
to a weight function w(x) on an interval a < x < b. Further, assume that 
the polynomials have leading coefficient 1: 


; n—l 
P,(x) = x" + y Gy jx! 
iat 


Find a symmetric tridiagonal matrix R,, for which p,(A) is the characteris- 
tic polynomial. Thus, calculating the roots of an orthogonal polynomial 
(and the nodes of a Gaussian quadrature formula) is reduced to the 
solution of an eigenvalue problem for a symmetric tridiagonal matrix. . 
Hint: Recall the formula for the triple recursion relation for { p,(x)}, and 
compare it to the formula (9.4.3). 


Use the QR method (a) without shift, and (b) with shift, to calculate the 
eigenvalues of 


210 oo 
(a) |1 2 1 ® 19 1 214 
Vode? va 
001 2 
0100 0 
11100 
() 10 11 1 0 
00111 
te | ee eas a 


Let A be a Hessenberg matrix and consider the factorization A = QR, with 
Q orthogonal and R upper triangular. 


(a) Recalling the discussion following (9.5.5), show that (9.5.7) is true. 


(b) Show that the result (9.5.7) implies a form for H, in (9.5.6) such that 
Q will be a Hessenberg matrix. 


(c) Show the product of a Hessenberg matrix and an upper triangular 
matrix, in either order, is again a Hessenberg matrix. ; 


When combined, these results show that RQ is again a Hessenberg matrix, 
as claimed in the paragraph following (9.5.7). 


For the matrix A of Problem 4, two additional approximate eigenvalues are 
A = 7.9329 and A = 5.6689. Use inverse iteration to calculate the associ- 
ated eigenvectors. 


Investigate the programs available at your computer center for the calcula- 
tion of the eigenvalues of a real symmetric matrix. Using such a program, 


26. 


27. 


29. 
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compute the eigerivalues of the Hilbert matrices H, for n = 3,4,5,6,7. To 
check your answers, see the very accurate values given in Gregory and 
Kamey (1969, pp. 66-73). 


Consider calculating the eigenvalues and associated eigenfunctions x(t) for 
which 


f——— = (5) O<s<l 
0 


One way to obtain approximate eigenvalues is to discretize the equation 
using numerical integration. Let h =1/n for some n21 and define 
t,=(j- 4h, j=1,...,n. Substitute t; for s in the equation, and ap- 
proximate the integral using the midpoint numerical integration method. 
This leads to the system 


& 
= AX(t;) i=1,...,0 
joi l+ (t;- 1) 


in which %(s) denotes a function that we expect approximates x(s). This 
system is the eigenvalue problem for a symmetric matrix of order n. Find 
the two largest eigenvalues of this matrix for n = 2, 4, 8, 16, 32. Examine the 
convergence of these eigenvalues as n increases, and attempt to predict the 
error in the most accurate case, n = 32, as compared with the unknown 
true eigenvalues for the integral equation. 


Show that the generalized inverse A* of (9.7.11) satisfies the following 
Moore—Penrose conditions. 


1 AAtA=A oo 3. (AAt)T = AAt 
2, AtAAt= At 4. (A*tA)’ =A*A 
Also show 

5. (A*tA)?=AtA 6. (AAt)? = AAt 


Conditions (3)-(6) show that 4*A and AA®* represent orthogonal projec- 
tions on R” and R”, respectively. 


For an arbitrary m X n matrix A, show that 
Limit (af + A74)~* A? = A* 
a—0+ 

where a > 0. Hint: Use the SVD of A. 


Unlike the situation with nonsingular square matrices, the generalized 
inverse A* need not vary continuously with changes in A. To support this, 
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31. 


32. 


33. 


find a family of matrices { A(e)} where A(e) converges to A(0), but A(e)* 
does not converge to A(0)*. 


Calculate the linear polynomial least squares fit for the following data. 
Graph the data and the least squares fit. Also, find the root-mean-square 
error in the least squares fit. 

iF b; tj b; t; b; 

— 1.0 1.032 -3 1.139 4 —.415 
—.9 1.563 -.2 646 5 —.112 
—.8 1.614 ~j 474 6 — 817 
~.7 1.377 0.0 418 7 — .234 
— 6 1.179 A .067 g- — .623 
=5 1.189 2 371 9 ~— 536 
~.4 910 3 183 1.0 —1.173 

Do a quadratic least squares fit to the following data. Use the standard 


form 
g(t) = x, + xt + x,t? 


and use the normal equation (9.7.26). What is the condition number of 
A‘A? 


t; b, t; b; t; b; 

— 1.0 7.904 =3 .335 4 —.711 
—9 7.452 =2 —.271 5 .224 
—.8 5.827 - J — .963 6 .689 
—.7 4.400 0.0 — .847 al 861 
—.6 2.908 A —1.278 8 1.358 
-.5 2.144 2 — 1.335 9 2.613 

1.0 4.599 


—.4 581 3 — .656 


For the matrix A arising in the least squares curve fitting of Problem 31, 
calculate its QR factorization, its SVD, and its generalized inverse. Use 
these to again solve the least squares problem. 


Find the QR factorization, singular value decomposition, and generalized _ 
inverse of the following matrices. Also give cond (A)>. 


oo. a 
(a) A={-10 -1.0 (b) A= 

11 9 3 4 5 

: : 4 5 6 


34, 


(a) 


(b) 
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Let A be m Xn, m2>n, and suppose that the columns of A are 
nearly dependent. More precisely, let A =[u,,...,u,], uj; € R”, and 
suppose the vector 

D= Qu, + +++ +a,U, 
is quite small compared to |lal],, a = [a,,...,a,]”. Show that A will 
have a large condition number. 


In contrast to part (a), suppose the columns of A are orthonormal. 
Show cond(A), = 1. 


INDEX 


i Note: (1) An asterisk (*) following a subentry name means that name is also listed separately with 
additional subentries of its own. (2) A page number followed by a number in parentheses, prefixed by 
! P, refers to a problem on the given page. For example, 123(P30) refers to problem 30 on page 123. 


Absolute stability, 406 
Acceleration methods: 

eigenvalues, 606 

linear systems, 83 

numerical integration, 255, 294 

rootfinding, 83 
Adams~Bashforth methods, 385 
Adams methods, 385 

DE/STEP, 390 

stability*, 404 

stability region*, 407 

variable order, 390 
Adams—Moulton methods, 387 
Adaptive integration, 300 

CADRE, 302 

QUADPACK, 303 

Simpson’s rule, 300 
Adjoint, 465 
Aitken, 86 
Aitken extrapolation: 

eigenvalues, 607 

linear iteration, 83 

numerical integration, 292 

rate of convergence, 123(P30) 
Algorithms: 

Aitken, 86 

Approx, 235, 

Bisect, 56 

Chebeval, 221 

Cq, 567 

Detrap, 376 

Divdif, 141 

Factor, 520 

Interp, 141 

Newton, 65 

Polynew, 97 

Romberg, 298 

Solve, 521 
Alternating series, 239(P2) 


Angle between vectors, 469 

Approx, 235 

Approximation of functions, 197 
Chebyshev series 219, 225 
de la Vallee—Poussin theorem, 222 
economization of power series, 245(P39) 
equioscillation theorem, 224 
even/odd functions, 229 
interpolation, 158 
Jackson’s theorem, 180, 224 
least squares*, 204, 206, 216 
minimax*, 201, 222 
near-minimax*, 225 
Taylor’s theorem, 4, 199 

AS(-), 513 

A-stability, 371, 408, 412 

Asymptotic error formula: 
definition, 254 
differential equations, 352, 363, 370 
Euler~MacLaurin formula’, 285, 290 
Euler’s method, 352 
numerical integration, 254, 284 
Runge-Kutta formulas, 427 
Simpson’s rule, 258 
trapezoidal rule, 254 

Augmented matrix, 510 

Automatic numerical integration, 299 
adaptive integration, 300 
CADRE, 302 
QUADPACK, 303 
Simpson’s rule, 300 


Back substitution, 508 

Backward differences, 151 

Backward differentiation formulas, 410 
Backward error analysis, 536 

Backward Euler method, 409 

Banded matrix, 527 

Basic linear algebra subroutines, 522, 570 
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Basis, 464 . . Christoffel-Darboux identity, 216 
: orthogonal, 469 Collocation methods, 444 
standard, 465 Column norm, 487 
Bauer—Fike theorem, 592 Compact methods, 523 
2 | Bernoulli numbers, 284 Compansion matrix, 649(P2) 
| Bernoulli polynomials, 284, 326(P23) Complete pivoting, 515 
Bernstein polynomials, 198 Complex linear systems, 575(P5) 
: Bessel’s inequality, 218 Composite Simpson’s rule, 257 
Best approximation, see Minimax approximation Composite trapezoidal mle, 253 
Binary number, 11 Cond( A), 530 
Binomial coefficients, 149 Cond(A),, Cond(A),, 531 
Biorthogonal family, 597 Condition number, 35, 58 
| Bisect, 56 calculation, 538 
| Bisection method, 56-58 eigenvalues, 594, 599 
i convergence, 58 Gastinel’s theorem, 533 
BLAS, 522, 570 Hilbert matrix, 534 
i Boole’s rule, 266 matrices, 530 
Boundary value problems, 433 Conjugate directions methods, 564 
collocation methods, 444 Conjugate gradient method, 113, 562, 566 
H existence theory, 435, 436 acceleration, 569 
i finite difference methods, 441 convergence theorem, 566, 567 
SEER ntact eesthian se eet et integral equation methods, 444 optimality, 566 
aes Sey ee ee a shooting methods, 437 projection framework, 583(P39) 
i Brent’s method, 91 Conjugate transpose, 465 
comparison with bisection method, 93 Consistency condition, 358, 395 
| convergence criteria, 91 Runge-Kutta methods, 425 
! Convergence: 
C”, 463 interval, 56 
i C,, 219 - linear, 56 
Cla, b}, 199 order, 56 
CADRE, 302 : quadratic, 56 
i Canonical forms, 474 rate of linear, 56 
| Jordan, 480 vector, 483 
1 Schur, 474 Conversion between number bases, 45(P10, P11) 
singular value decomposition, 478 Corrected trapezoidal rule, 255, 324(P4) 
i symmetric matrices, 476 Corrector formula, 370 
Cauchy—Schwartz inequality, 208, 468 Cq, 567 
Cayley—Hamilton theorem, 501(P20) Cramer’s rule, 514 
Change of basis matrix, 473 Crout’s method, 523 
| Characteristic equation: 
i differential equations, 364, 397 Data error, 20, 29, 325(P13) 
i matrices, 471 : Deflation, polynomial, 97 
. Characteristic polynomial, 397, 471 matrix, 609, 651(P13) 
Characteristic roots, 398 Degree of precision, 266 
i Chebeval, 221 de la Vallee—Poussin theorem, 222 
Chebyshev equioscillation theorem, 224 Dense family, 267 
: Chebyshev norm, 200 Dense linear systems, 507 
| Chebyshev polynomial expansion, 219, 225 Dense matrix, 507 
i Chebyshev polynomials, 211 DE/STEP, 390 
: . maxima, 226 Detecting noise in data, 153 
| minimax property, 229 Determinant, 467, 472 
second kind, 243(P24) calculation, 512 
triple recursion formula, 211 Detrap, 376 
zeros, 228 Diagonally dominant, 546 
Chebyshev zeros, interpolation at, 228 Difference equations, linear, 363, 397 
Cholesky method, 524, 639 Differential equations: 
Chopping, 13 automatic programs, 
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Adam’s methods, 390 Direction field, 334 
: boundary value codes, 444, 446 Dirichlet problem, see Poisson's equation 
i comparison, 446 Divdif, 141 
i control of local error, 373, 391 Divided difference interpolation formula, 140 
DE/STEP, 39% Divided differences, 9, 139 
error control, 391 continuity, 146 
error per unit stepsize, 373 diagram for calculation, 140 
: global error, 392 differentiation, 146 
: RKF45, 431 formulas, 139, 144 
: variable order, 390 Hermite—Gennochi formula, 144 
i boundary value problems, 433 t interpolation, 140 
direction field, 334 polynomials, 147 
existence theory, 336 relation to derivatives, 144 
first order linear, 333 recursion relation, 139 
higher order, 340 Doolittle’s method, 523 
! ill-conditioned, 339 
initial value problem, 333 Economization of Taylor series, 245(P39) 
integral equation equivalence, 451(P5) Eigenvalues, 471 
linear, 333, 340 Bauer—Fike theorem, 592 
model equation, 396 condition number, 594, 600 
numerical solution: deflation, 609 
Adam’s methods*, 385 error bound for symmetric matrices, 595 
A-stability, 371, 408, 412 ; Gerschgorin theorem, 588 
backward Euler method, 409 ill-conditioning, 599 
boundary value problems*, 433 location, 588 
characteristic equation, 397 matrices with nondiagonal Jordan form, 601 
convergence theory, 360, 401 multiplicity, 473 
: corrector formula, 370 numerical approximation 
i Euler’s method*, 341 EISPACK, 588, 645, 662 
explicit methods, 357 Jacobi method, 645 
extrapolation methods, 445 power method*, 602 
global error estimation, 372, 392 QR method*, 623 
grid size, 341 sparse matrices, 646 
: implicit methods, 357 Sturm sequences, 620 
: lines, ‘method of, 414 . numerical solution, see Eigenvalues, numerical 
; local solution, 368 approximation 
midpoint method*, 361 stability*, 591 
i model equation, 363, 370, 397 under unitary transformations, 600 
1 : multistep methods*, 357 symmetric matrices*, 476 
numerical integration, 384 tridiagonal matrices*, 619 
node points, 341 Eigenvector(s), 471 
predictor formula, 370 numerical approximation: 
Runge-Kutta methods*, 420 ETSPACK, 588, 645, 662 
single-step methods, 418 error bound, 631 
stability, 349, 361, 396 inverse iteration*, 628 
stability regions, 404 power method*, 602 
stiff problems*, 409 " stability, 600 
Taylor series method, 418 EISPACK, 588, 645, 662 
trapezoidal method", 366 Enclosure rootfinding methods, 58 
undetermined coefficients, 381 - Brent’s method, 91 
variable order methods, 390, 445 Error: 
3 Picard iteration, 451(P5) backward analysis, 536 
ee : stability, 337 chopping, 13 
; stiff, 339, 409 data, 20, 29, 153 
systems, 339 definitions, 17 
Differentiation, see Numerical differentiation loss of significance, 24, 28 
Dimension, 464 machine,.20 
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Error (Continued) 
noise, 21 
propagated, 23, 28 
rounding, 13 
sources of, 18 
statistical treatment, 31 
summation, 29 
truncation, 20 
unit roundoff, 15 
Error estimation: 
global, 392, 433 
rootfinding, 57, 64, 70, 84, 129(P16) 
Error per stepsize, 429 
Error per unit stepsize, 373 
Euclidean norm, 208, 468 
Euler-MacLaurin formula, 285 
generalization, 290 
summation formula, 289 
Euler’s method, 341 
asymptotic error estimate, 352, 356 
backward, 409 
convergence analysis, 346 
derivation, 342 
” error bound, 346, 348 
rounding error analysis, 349 
stability*, 349, 405 
systems, 355 
truncation error, 342 
Even/odd functions, 229 
Explicit multistep methods, 357 
Exponent, 12 
Extrapolation methods: 
differential equations, 445 
numerical integration, 294 
rootfinding, 85 


F,(x), 232 
Factor, 520 
Fast Fourier transform, 181 
Fehlberg, Runge-Kutta methods, 429 
Fibonacci sequence, 68 
Finite differences, 147 
interpolation formulas, 149, 151 
Finite dimension, 464 
Finite Fourier transform, 179 
Fixed point, 77 
Fixed point iteration, see One-point iteration 
methods, 13 
fi(x),13 
Floating-point arithmetic, 11-17, 39 
Floating-point representation, 12 
accuracy, 15 . 
chopping, 13 
conversion, 12 
exponent, 12 
mantissa, 12 
overflow, 16 


radix point, 12 
rounding, 13 
underflow, 16 
unit round, 15 
Forward differences, 148 
detection of noise by, 153 
interpolation formula, 149 
linearity, 152 
relation to derivatives, 151 
relation to divided differences, 148 
tabular form, 149 
Fourier series, 179, 219 
Frobenius norm, 484 


Gastinel’s theorem, 533 

Gaussian elimination, 508 
backward error analysis, 536 
Cholesky method, 524 
compact methods, 523 
complex systems, 575(P5) 
error analysis, 529 
error bounds, 535, 539 
error estimates, 540 
Gauss—Jordan method, 522 
iterative improvement, 541 
LU factorization*, 511 
operation count, 512 
pivoting, 515 


positive definite matrices, 524, 576(P12) 


residual correction*, 540 
scaling, 518 
tridiagonal matrices, 527 
variants, 522 
Wilkinson theorem, 536 
Gaussian quadrature, 270. See also 
Gauss—Legendre quadrature 
convergence, 277 
degree of precision, 272 
error formulas, 272, 275 
formulas, 272, 275 
Laguerre, 308 
positivity of weights, 275 
singular integrals, 308 
weights, 272 
Gauss—Jacobi iteration, 545 
Gauss—Jordan method, 522 
matrix inversion, 523 
Gauss—Legendre quadrature, 276 
comparison to trapezoidal rule, 280 
computational remarks, 281 
convergence discussion, 279 
error formula, 276 
Peano kernel, 279 
weights and nodes, 276 
Gauss-Seidel method, 548 
acceleration, 555 
convergence, 548, 551 


rate of convergence, 548 
Generalized inverse, 636, 655(P27, P28) 
Geometric series, 5, 6 

matrix form, 491 
Gerschgorin theorem, 588 
Global error, 344 

estimation, 392, 433 
Gram-—Schmidt method, 209, 242(P19) 
Grid size, 341 


Heat equation, 414 
Hermite-Birkhoff interpolation, 190(P28, P29), 
191(P30) 

Hermite—Gennochi formula, 144 

Hermite interpolation, 159 
error formula, 161, 190(P27) 
formulas, 160, 161, 189(P26) 
Gaussian quadrature, 272 
general interpolation problem, 163 
piecewise cubic, 166 

Hermitian matrix, 467. See also Symmetric 

matrix 

Hessenberg matrix, 624 

Hexadecimal, 11 

Higher order differential equations, 340 

Hilbert matrix, 37, 207, 533 
condition number, 534 
eigenvalues, 593 

Horner’s method, 97 

Householder matrices, 609, 651(P12) 
QR factorization*, 612 
reduction of symmetric matrices, 615 
transformation of vector, 611 


T(x), 228 
Iil-conditioned problems, 36 
differential equations, 339 
eigenvalues, 599 
inverse problems, 40 
linear systems, 532 
polynomials, 99 
Ill-posed problem, 34, 329(P41) 
Implicit multistep methods, 357 
iterative solution, 367, 381 
Infinite dimension, 464 
Infinite integrand, 305 
Infinite interval of integration, 305 
Infinite product expansion, 117(P1) 
Infinity norm, 10, 200 
Influence function, 383 
Initial value problem, 333 
Inner product, 32, 208, 468 
error, 32 
Instability, 39. See also Iil-conditioned problems 
Integral equation, 35, 444, 451(P5), 570, 575(P4), 
655(P26) 
Integral mean value theorem, 4 
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Integration, see Numerical integration 
Intermediate value theorem, 3 
Interp, 141 
Interpolation: 
exponential, 187(P11) 
multivanable, 184 
piecewise polynomial*, 162, 183 
polynomial, 131 
approximation theory, 158 
backward difference formula, 151 
barycentric formula, 186(P3) 
at Chebyshev zeros, 228 
definition, 131 
divided difference formula, 140 
error behavior, 157. 
error in derivatives, 316 
error formula, 134, 143, 155-157 
example of logjo x, 136 : 
existence theory, 132 
forward difference formula, 149 
Hermite, 159, 163 
Hermite—Birkhoff, 190(P29) 
inverse, 142 
Lagrange formula, 134 
non-convergence, 159 
numerical integration*, 263 
rounding errors, 137 
Runge’s example, 158 
rational, 187(P12) 
spline function*, 166 
trigonometric*, 176 
Inverse interpolation, 142 
Inverse iteration, 628 
computational remarks, 630 
rate of convergence, 630 
Inverse matrix, 466 
calculation, 514, 523 
error bounds, 538 
iterative evaluation, 581(P32) 
operation count, 514 
Iteration methods, see also Rootfinding 
differential equations, 367, 381 
eigenvalues, 602 
eigenvectors, 602, 628 
linear systems: 


comparison to Gaussian elimination, 554 


conjugate gradient method*, 562, 566 
error prediction, 543, 553 
Gauss—Jacobi method, 545 
Gauss—Seidel method, 548 
general schema, 549 
multigrid methods, 552 
Poisson’s equation, 557 
rate of convergence, 542, 546, 548 
SOR method, 555, 561 
nonlinear systems of equations*, 103 
one-point iteration*, 76 
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Iteration methods (Continued) 

polynomial rootfinding*, 94-102 
Iterative improvement, 541 
Interval analysis, 40 


Jackson’s theorem, 180, 224 
Jacobian matrix, 105, 356 
Jacobi method, 645 

Jordan block, 480 

Jordan canonical form, 480 


Kronecker delta function, 466 
Kronrod formulas, 283 


Lagrange interpolation formula, 134 
Laguerre polynomials, 211, 215 
Least squares approximation: 
continuous problem, 204 
convergence, 217 
formula, 217 
weighted, 216 
discrete problem, 633 
data fitting problem, 637 
definition, 634 : 
generalized inverse, 636, 655(P27, P28) 
QR solution procedure, 642 
singular value solution, 635 
stability, 637 
Least squares data fitting, see Least squares 
approximation, discrete problem 
Legendre polynomial expansion, 218 
Legendre polynomials, 210, 215 
Level curves, 335 
Linear algebra, 463 
Linear combination, 464 
Linear convergence, 56- 
acceleration, 83 
rate, 56 
Linear dependence, 464 
Linear difference equations, 363, 397 
Linear differential equation, 333, 340 
Linear independence, 464 
Linear iteration, 76, 103. See also One-point 
iteration methods 
Linearly convergent methods, 56 
Linear systems of equations, 466, 507 
augmented matrix, 510 
BLAS, 522, 570 
Cholesky method, 524 
compact methods, 523 
condition number, 530, 531 
conjugate gradient method *, 562, 566 
Crout’s method, 523 
dense, 507 
Doolittle’s method, 523 
error analysis, 529 
error bounds, 535 


Gaussian elimination *, 508 
variants, 522 

Gauss-Jacobi method, 545 

Gauss-Seidel method *, 548 

iterative solution *, 540, 544 

LINPACK, 522, 570, 663 

LU factorization «, 511 

numerical solution, 507 

over-determined systems, see Least squares 
approximation, discrete problem, data fitting 
problem 

Poisson’s equation *, 557 

residual correction method *, 540 

scaling, 518 

solution by QR factorization, 615, 642 

solvability of, 467 

SOR method, 555, 561 

sparse, 507, 570 

tridiagonal, 527 

Lines, method of, 414. see also Method of lines 
LINPACK, 522, 570, 663 


"Lipschitz condition, 336, 355, 426 


Local error, 368 
Local solution, 368 
Loss of significance error, 24, 28 
LU factorization, 511 
inverse iteration, 628 
storage, 511 
tridiagonal matrices, 527 
uniqueness, 523 


Machine errors, 20 
Mantissa, 12 
Mathematical modelling, 18 
Mathematical software, 41, 661 
Matrix, 465 
banded, 527 
canonical forms *, 474 
characteristic equation, 471 
condition number, 530, 531 
deflation, 609, 651(P13) 
diagonally dominant, 546 
geometric series theorem, 491 
Hermitian, 467 
Hilbert, 37, 207, 533 
Householder *, 609 
identity, 466 
inverse *, 466, 514 
invertibility of, 467 
Jordan canonical form, 480 
LU factorization *, 511 
nilpotent, 480 
norm, 481. See also Matrix norm; Vector norm 
notation, 618 
operations on, 465 
order, 465 
orthogonal, 469. See also Orthogonal 
transformations 


permutation, 517 
perturbation theorems, 493 
positive definite, 499(P14), 524 
principal axes theorem, 476 
projection, 498(P10), 583(P39) 
rank, 467 
Schur normal form, 474 
similar, 473 
singular value decomposition, 478 - 
symmetric *, 467 
tridiagonal, 527 
unitary, 469, 499(P13) 
zero, 466 
Matrix norm, 484 
column, 487 
compatible, 484 
Frobenius, 484 
operator, 485 
relation to spectral radius, 489 
row, 488 
Maximum norm, 200, 481 
MD(e), 513 
Mean value theorem, 4 
Method of lines, 414 
convergence, 415 
explicit method, 416 
heat equation, 414 
implicit methods, 417 
Midpoint method, differential equations, 361 
asymptotic error formula, 363 
characteristic equation, 364 
error bound, 361 
weak stability, 365 
Midpoint numerical integration, 269, 325(P10) 
Milne’s method, 385 
Minimax approximation, 201 
equioscillation theorem, 224 
error, 201 
speed of convergence, 224 
MINPACK, 114, 570, 663 
Moore—Penrose conditions, 655(P27) 
Muller’s method, 73 
Multiple roots, 87 
instability, 98, 101 
interval of uncertainty, 88 
Newton’s method, 88 
noise, effect of, 88 
temoval of multiplicity, 90 
Multipliers, 509 
Multistep methods, 357 
Adams-—Bashforth, 385 
Adams methods #, 385 
Adams-—Moulton, 387 
convergence, 360, 401 
consistency condition, 358, 395 
derivation, 358, 381 
error bound, 360 
explicit, 357 
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general form, 357 

general theory, 358, 394 

implicit, 357, 381 

iterative solution, 367, 381 

midpoint method *, 361 

Milne’s method, 385 

model equation, 363, 397 

numerical integration, 384 

order of convergence, 359 

parasitic solution, 365, 402 

Peano kernel, 382 

relative stability, 404 

root condition, 395 

stability, 361, 396 

stability regions, 404 

stiff differential equations, 409 

strong root condition, 404 

trapezoidal method *, 366 

_truncation error, 357 

undetermined coefficients, method of, 381 

unstable examples, 396 

variable order, 390, 445 
Multivariable interpolation, 184 
Multivariable quadrature, 320 


Near-minimax approximation, 225 
Chebyshev polynomial expansion, 219, 225 
forced oscillation, 232 
interpolation, 228 

Nelder—Mead method, 114 

Nested multiplication, 96 

Newton, 65 

Newton backward difference, 151. See also 

Backward differences 

Newton—Cotes integration, 263 
closed, 269 
convergence, 266 
error formula, 264 
open, 269 

Newton divided differences, 139. See also 

Divided differences 
Newton forward differences, 148. See also 
Forward differences 

Newton—Fourier method, 62 

Newton’s method, 58 
boundary value problems, 439, 442 
comparison with secant method, 71 
convergence, 60 
error estimation, 64 
error formula, 60 
multiple roots, 88 
Newton-Fourier method, 62 
nonlinear systems, 109, 442 
polynomials, 97 
reciprocal calculations, 54 
square roots, 119(P12, P13) 

Nilpotent matrix, 480 

Node points, 250, 341 


690 INDEX 


Noise in data, detection, 153 
Noise in function evaluation, 21, 88 
Nonlinear equations, See Rootfinding 
Nonlinear systems, 103 
convergence, 105, 107 
fixed points iteration, 103 
MINPACK, 114, 663 
Newton’s method, 109 
Normal equation, 638 
Norms, 480 
compatibility, 484 
continuity, 482 
equivalence, 483 
Euclidean, 208, 468 
Frobenius, 484 
matrix *, 484 
maximum, 200, 481 
operator, 485 
spectral radius, 485 
two, 208 
uniform, 200 
vector *, 481 
Numerical differentiation, 315 
error formula, 316 
ill-posedness, 329(P41) 
interpolation based, 315 
noise, 318 
undetermined coefficients, 317 
Numerical integration, 249 
adaptive +, 300 
Aitken extrapolation, 292 
automatic programs, 302, 663 
Boole’s rule, 266 
CADRE, 302 
convergence, 267 
comparison of programs, 303 
corrected trapezoidal rule, 255 
degree of precision, 266 


Euler—MacLaurin formula +, 285, 290 


Gaussian quadrature *, 270 
Gauss—Legendre quadrature *, 276 
general schemata, 249 

Kronrod formulas, 283 
midpoint rule *, 269 

multiple integration, 320 
Newton-—Cotes formulas *, 263 
noise, effect of, 325(P13) 

open formulas, 269 

Patterson’s method, .283 
rectangular rule, 343 
Richardson extrapolation, 294 
Romberg’s method, 298 
Simpson’s method «, 256 
singular integrals *, 305 
standard form, 250 : 
three-eights rule, 264 
trapezoidal rule *, 252 


O(h?), O(1/n*), 291, 352 
One-point iteration methods, 76 
convergence theory, 77—83 
differential equations, 367, 413 
higher order convergence, 82 
linear convergence, 56 
Newton's method *, 58 
nonlinear systems *, 103 
Operations count, 576(P6) 
Gaussian elimination, 512 
Operator norm, 485 
Optimization, 111, 115.499(P17) 
conjugate directions method, 564 
conjugate gradient method, 113 
descent methods, 113 
method of steepest descent, 113 
MINPACK, 114, 663 
Nelder—Mead method, 114 
Newton’s method, 112 
quasi-Newton methods, 113 
Order of convergence, 56 
Orthogonal, 209. See also Orthogonal 
transformations 
basis, 469 
family, 212 
matrix, 469 
Orthogonal polynomials, 207 
Chebyshev polynomials +, 211 
Christoffel-Darboux identity, 216 
Gram—Schmidt method, 209 
Laguerre, 211, 215 
Legendre, 210, 215 
triple recursion relation, 214 
zeros, 213, 243(P21) 


Orthogonal projection, 470, 498(P10), 583(P39) 
‘Orthogonal transformations: 


Householder matrices *, 609 
planar rotations, 618, 652(P15) 
preservation of length, 499(P13) 
QR factorization +, 612 
symmetric matrix, reduction of, 615 
transformation of vector, 611 

Orthonormal! family, 212 

Overfiow error, 17, 22 


Pade approximation, 237, 240(P5) 

Parallel computers, 571 

Parasitic solution, 365, 402 

Parseval’s equality, 218 . 

Partial differential equations, 414, $57 

Partial pivoting, 515 

Patterson’s method, 283 

Peano kernel, 258, 279, 325(P10) 
differential equations, 382 

Permutation matrix, 517 

Picard iteration, 451(P5) 
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Piecewise polynomial interpolation, 163, 183 Quadrature, See Numerical integration 


evaluation, 165 
Hermite form, 166 
Lagrange form, 164 
spline functions *, 166 
Pivot element, 515 
Pivoting, 515 
Planar rotation matrices, 618, 652(P15) 
p-Norm, 481 
Poisson’s equation, 557 
finite difference approximation, 557 
Gauss-Seidel method, 553, 560 
generalization, 582(P34) 
SOR method, 556, 561 
Polynew, 97 
Polynomial interpolation, 131. See also 
Interpolation, polynomial 
Polynomial perturbation theory, 98, 99 
Polynomial rootfinding, 94-102 
bounds on roots, 95, 127(P36, P37) 
companion matrix, 649(P2) 
deflation, 97, 101 
ill-conditioning, 99 
Newton’s method, 97 
stability, 98 
Positive definite matrix, 499(P14), 524 
Cholesky’s method, 524 
Power method, 602 
acceleration techniques, 606 
Aitken extrapolation, 607 
convergence, 604 
deflation, 609, 651(P13) 
Rayleigh—Ritz quotient, 605 
Precision, degree, 266 
Predictor—corrector method, 370, 373. See also 
Multistep methods 
Predictor formula, 370 
Principal axes theorem, 476 
Principal root, 398 
Product integration, 310 
Projection matrix, 498(P10), 583(P39) 
Propagated error, 23, 28 


gt (x), 201 
QR factorization, 612 
‘practical calculation, 613 
solution of linear systems, 615, 642 
uniqueness, 614 
QR method, 623 
convergence, 625 
preliminary preparation, 624 
rate of convergence, 626 
with shift, 626 
QUA DPACK, 303, 663 
Quadratic convergence, 56 
Quadratic form, 497(P7) 


Quasi—Newton methods, 110, 113 


r* (x), 204, 207 

R’, 463 

Radix, 12 

Rank, 467 

Rational interpolation, 187(P12) 

Rayleigh quotient, 608, 651(P11) 

Relative error, 17 

Relative stability, 404 

Region of absolute stability, 404 

Residual correction method, 540 
convergence, 542 
error bounds, 543 

Reverse triangle inequality, 201 

Richardson error estimate, 296 

Richardson extrapolation, 294, 372 

RK, 420 

RKF, 429 

RKF45, 431 

Romberg, 298 

Romberg integration, 296 
CADRE, 302 

Root condition, 395 

Rootfinding, 53 
acceleration, 85 
Aitken extrapolation *, 83 
bisection method *, 56 
Brent’s method *, 91 
enclosure methods, 58 
error estimates, 57, 64, 70 
Muller’s method *, 73 
multiple roots *, 87 
Newton’s method *, 58, 108 
nonlinear systems *, 103 
one-point iteration methods *, 76 
optimization, 111 
polynomials *, 94-102 
secant method *, 66 
Steffenson’s method, 122(P28) 
stopping criteria, 64 

Root mean square error, 204, 638 

Rotations, planar, 618, 652(P15) 

Rounding error, 13 
differential equations, 349 
Gaussian elimination, 536 
interpolation, 137, 187(P8) 
numerical differentiation, 318 
numerical integration, 325(P13) 

Row norm, 488 

Runge-Kutta methods, 420 
asymptotic error, 427 
automatic programs, 431 
classical fourth order formula, 423 
consistency, 425 
convergence, 426 
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Runge-Kutta methods (Continued) 
derivation, 421 
error estimation, 427 
Fehiberg methods, 429 
global error, 433 
implicit methods, 433 
low order formulas, 420, 423 
Richardson error estimate, 425 
RKF4S5, 431 
stability, 427 
truncation error, 420 


Scalars, 463 
Scaling, 518 
Schur normal form, 474 
Secant method, 66 
error formula, 67, 69 
comparison with Newton’s methods, 71 
convergence, 69 
Shooting methods, 437 
Significant digits, 17 
Similar matrices, 473 
Simpson’s rule, 256 
adaptive, 300 z 
Aitken extrapolation, 292 
asymptotic error formula, 258 
composite, 257 
differential equations, 384 
error formulas, 257, 258 
Peano kernel, 260 
product rule, 312 
- Richardson extrapolation, 295 : 
Simultaneous displacements, 545. See also 
Gauss~Seidel method 
Single step methods, 418. See also Runge-Kutta 
methods 
Singular integrands, 305 
analytic evaluation, 310 
change of variable, 305 
Gaussian quadrature +, 308 
IMT method, 307 
product integration, 310 
Singular value decomposition, 478, 500(P19), 634 
computation of, 643 ‘ 
Singular values, 478 
Skew-symmetric matrix, 467 
Solve, 521 
SOR meth, 555, 561 
Sparse linear systems, 507, 570 
eigenvalue problem, 646 
Poisson’s equation, 557 
Special functions, 237 
Spectral radius, 485 
Spline function, 166 
B-splines, 173 
complete spline interpolant, 169 
construction, 167 


error, 169 

natural spline interpolation, 192(P38) 
not-a-knot condition, 171 
optimality, 170, 192(P38) 

Square root, calculation, 119(P12, P13) 

Stability, 34 
absolute, 406 
differential equations, 337 

numerical methods, 349, 361, 396 
eigenvalues, 592, 599 
Euler’s method, 349 
numerical methods, 38 
polynomial rootfinding, 99 
relative stability, 404 
weak, 365 

Stability regions, 404 

Standard basis, 465 

Steffenson’s method, 122(P28) 

Stiff differential equations, 409 
A-stable methods, 371, 408, 412 
backward differentiation formulas, 410 
backward Euler method, 409 
iteration methods, 413 
method of lines, 414 
trapezoidal method, 412 

Stirling’s formula, 279 

Strong root condition, 404 

Sturm sequence, 620 

Successive displacements, 548. See also 

Gauss—Seidel method 

Successive over-relaxation, See SOR method 

Summation errors, 29-34 
chopping vs. rounding, 30 
inner products, 32 
loss of significance errors, 27 
statistical analysis, 31 

Supercomputer, 40, 571 

Symbolic mathematics, 41 

Symmetric matrix, 467 
deflation, 651(P13) 
eigenvalue computation, 619, 623 
eigenvalue error bound, 595 
eigenvalues, 476 
eigenvalue stability, 593, 595 
eigenvector computation, 631 
Jacobi’s method, 645 
positive definite, 499(P14), 576(P12) 
QR method +, 623 
Rayleigh-Ritz quotient, 608, 651(P11) 
similarity to diagonal matrix, 476 
Sturm sequence, 620 
tridiagonal matrix, reduction to, 615 
Wielandt—Hoffman theorem, 595 

Systems of differential equations, 339, 355, 397, 

437 

Systems of linear equations, See Linear systems 

of equations 


Taylor series method, differential equations, 420 
Taylor’s theorem, 4, 199 
geometric series theorem, 5, 6 
important expansions, 5 
two-dimensional form, 7 
Telescoping of Taylor series, 245(P39) 
Three-eights quadrature rule, 264 
Trace, 472 
Transpose, 465 
Trapezoidal methods; 
differential equations, 366 
A-stability, 371, 456(P37) 
asymptotic error, 370 
convergence, 370 
global error, 379 
iterative solution, 367 
Jocal error, 368 
Richardson extrapolation, 372 
stability, 370 
stability region, 409 
numerical integration, 252 
asymptotic error formula, 254 
comparison to Gaussian quadrature, 280 
composite, 253 
corrected trapezoidal rule, 255 
error formula, 253 
Euler—MacLaurin formula *, 285 
Peano kernel, 259, 324(P5) 
‘periodic integrands, 288 
product rule, 311 
Richardson extrapolation, 294 
Triangle inequality, 10, 200, 468 
Triangular decomposition, see LU factorization 
Tridiagonal linear systems, solution, 527 
Tridiagonal matrix eigenvalue problem, 619 
error analysis, 622 
Given’s method, 619 
Householder’s method, 619 
Sturm sequences, 620 
Trigonometric functions, discrete orthogonality, 
178, 193(P42), 233 
Trigonometric interpolation, 176 
convergence, 180 
existence, 178, 179 
fast Fourier transform, 181 
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Trigonometric polynomials, 176 
Triple recursion relation, 39 
Truncation error, 20, 342, 357 
Two-norm, 208 


Undetermined coefficients, method of, 317 

Uniform approximation, see Approximation of 
functions 

Uniform norm, 200 

Unitary matrix, 469, 499(P13). See also 
Orthogonal transformations 

Unit roundoff error, 15 

Unstable problems, 34. See also Ill-conditioned 
problems 


Vandermonde matrix, 132, 185(P1) 
Vector computer, 571 
Vector norm, 10, 481 © 
continuity of, 482 
equivalence, 483 
maximum, 200, 481 
p-norm, 481 
Vectors, 463 
angle between, 469 
biorthogonal, 597 
convergence, 483 
dependence, 464 
independence, 464 
norm *, 481. See also Vector norm 
orthogonal, 469 
Vector space, 463 
basis, 464 
dimension, 464 
inner product, 467 
orthogonal basis, 469 


Weak stability, 365 

Weierstrass theorem, 198 

Weight function, 206, 251 

Weights, 250 

Well-posed problem, 34. See also Stability 
Wielandt-Hoffman theorem, 595 


Zeros, see Rootfinding 


